Unfolding the universe of possibilities..

Dancing with the stars of binary realms.

Class Imbalance and Oversampling: A Formal Introduction

Let’s explore the class imbalance problem and how resampling methods such as random oversampling attempt to solve it.

Lately, I have been building a package to address class imbalance in Julia called Imbalance.jl. It took me a lot of effort in terms of reading papers and looking at implementations while building the package so I thought it may be helpful to share what I’ve learned about the class imbalance problem in general, along with some of the most popular algorithms used to tackle it. This includes Naive Random Oversampling, Random Oversampling Examples (ROSE), random walk oversampling (RWO), Synthetic Minority Oversampling Technique (SMOTE), SMOTE-Nominal, and SMOTE-Nominal Continuous and many undersampling approaches. For this story, we will formally define the class imbalance problem and discuss how random oversampling is a valid solution. In later articles, we will rationally arrive at more solutions for it by considering other techniques.

Photo by Artem Kniaz on Unsplash

The Class Imbalance Problem

Most if not all machine learning algorithms can be viewed as a form of empirical risk minimization where the objective is to find the parameters θ that for some loss function L minimize:

For instance, linear regression takes L to be squared loss, logistic regression takes it to be cross-entropy loss, SVM takes it to be hinge loss, Adaptive boosting takes it to be exponential loss.

The underlying assumption is that if the parameters of f_θ allow it to minimize this empirical risk over the dataset which may be viewed as a random sample from the population then it should be close enough to the target function f that we seek the model which minimizes the same quantity over the dataset and moreover the whole population.

In a multi-class setting with K classes, we can write the empirical risk as

Class imbalance occurs when some classes have much fewer examples than other classes. In this case, the terms corresponding to such classes contribute minimally to the sum which makes it possible for any learning algorithm to find an approximate solution to minimizing the empirical risk that mostly only minimizes it over the significant sums. This yields a hypothesis f_θ that may be very different from the true target f with respect to the minority classes which may be the most important for the application in question.

Conclusively, the following are the conditions where we have a class imbalance problem:

1 — The points in the training set are not distributed “fairly” among the classes; some classes have far less points belonging to them than others.

2 — The model does not perform well on points belonging to such minority classes after training. That is, the learning algorithm has not minimized the loss appropriately for the minority classes as explained above.

The degree to which this problem is adverse depends on how important such minority classes are for the application. Most often they are much more important than the majority class (e.g., classifying fraudulent transactions or rare diseases).

Solving the Class Imbalance Problem

It becomes obvious from the description of the problem that one remedy is to weight the smaller sums (i.e., those belonging to minority classes) so that a learning algorithm more easily avoids approximate solutions that exploit their insignificance. It’s often easily possible to modify machine learning algorithms for that purpose; especially when they are explicitly a form of empirical risk minimization and is not just equivalent to it for some loss function.

Another approach that attempts to solve the problem without requiring any modifications to the learning algorithm is resampling the data. In its simplest form it can be viewed as equivalent to the cost-sensitive approach of assigning weights. Consider the following algorithm:

Given: An imbalanced dataset with K classes and an integer for each class
Wanted: A dataset where data from each class is replicated according to the integer
Operation: Repeat each point in class k, c times, where c is the associated integer with that class

It should be obvious by plugging to the sum that this is equivalent to the cost-sensitive approach; recall that minimizing a function is equivalent to minimizing a scalar positive multiple of it.

Random Oversampling

The aforementioned algorithm suffers from a small problem; if class A has 900 examples and class B has 600 examples then there is no integer multiple that we can use to oversample class B to bring the dataset to exact balance. We can extend the algorithm to deal with replication ratios that aren’t integers by choosing points to replicate randomly. For instance, if we want to oversample 300 examples for class B so that the system is brought to balance (equivalent to a ratio of 1.5) then we can do so by…

1 — Choosing 300 points randomly from class B
2 — Replicating those points

This algorithm is called naive random oversampling and what it formally does is:

1 — Compute the necessary number of points to generate for each class (computed from some given ratios)
2 — Suppose for class k that number is N_k, then choose N_k points randomly with replacement from the points belonging to that class then add them to form the new dataset

It should be obvious that this is on average equivalent to the aforementioned algorithm and hence, also class-weighting. If the ratio for class k is 2.0 then on average each point will be picked once by random.

Here is a random imbalanced dataset that I have generated for three classes (0, 1, 2) which shows a histogram of the classes and a scatter plot of the points before and after oversampling.

Figure by the Author

Notice that there is no visual difference in the two figures in the bottom because all generated points are replicas of existing ones.

Why Oversampling?

You may be so far not so convinced that resampling is more prominent than class-weight to solve the class imbalance problem; after all, we have shown that naive random oversampling can be viewed to be on average equivalent to class-weighting with longer training times (due to more data). The only catch in this argument is that it’s not generally possible to use class-weighting; especially, if the machine learning model does not minimize loss explicitly.

It makes sense that we would achieve better results than naive random oversampling (or class weighting) if we rather naturally added the points for each class by collecting more data in our dataset. For instance, suppose that…

We want to detect whether a transaction is fraudWe have collected a dataset of 1K fraud transactions and 999K of valid transactions

It should be obvious that solving the imbalance problem by collecting another 998K fraud transactions is far more superior than repeating the existing 1K ones 997K times. In particular, we run a high risk of overfitting to the particular data observed in the latter case.

The reality is obviously that it is not generally feasible to collect more data for the minority classes; however, it may be possible to simulate the effect of doing so. Perhaps, if we synthetically generate data for the minority classes that is representative enough of real new examples then ideally, this is much better than repeating examples or class weighting. This is the most common form of oversampling and its relation to collecting real data may be the most plausible justification for why it can outperform random oversampling and class weighting; hence, being a worthwhile approach to solve the class imbalance problem.

Undersampling

Lastly note that if instead of randomly choosing examples to replicate in minority classes, we randomly chose points to remove in majority classes then then the algorithm becomes naive random undersampling. This obviously has the disadvantage of losing useful data but sometimes eliminating data from “not so useful” majority classes solves the imbalance problem and leads to much better performance towards the “much more useful” minority classes. There are also other undersampling approaches that more carefully choose which points to exclude such as to preserve the structure of the data or to allow for better decision boundaries.

In further stories we explain all the algorithms initially implemented in Imbalance.jl and how they solve the class imbalance problem.

Class Imbalance and Oversampling: A Formal Introduction was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment