Imbalanced learning problems often stump those new to dealing with them. When the ratio between classes in your data is 1:100 or larger, early attempts to model the problem are rewarded with very high accuracy but very low specificity. You can solve the specificity problem in imbalanced learning in a few different ways:

  • You can naively weight the classes, making your model preferential to the minority class.
  • You can use undersampling, oversampling or a combination of the two.
  • You can switch your goal from trying to balance the dataset, to trying to predict the minority class using outlier detection techniques.

In this post, I’ll show you how, and more importantly, when, to use the last of these methods and compare the results to the weighting and rebalancing approaches.

An example

Fraud detection is a common use case where imbalanced learning shows up, here’s a sample of some fraud data which has five features and a binary target column that lets us know whether the account has been linked to fraudulent activity.

x1 x2 x3 x4 x5 y
1586.332656 -3175.556625 169.110872 -2.074710 347.898414 0
576.724472 -1584.409291 -15.996214 -8.337449 196.573206 0
1008.873704 -2512.834630 -49.949509 4.261361 346.487801 0
-3531.804947 842.087872 -57.086652 30.201235 170.465378 1
1819.857735 -118.635258 -334.706060 6.570076 255.527310 0

Inspecting the dataset, we can see that of our 5000 observations, only 107 represent fraudulent accounts. A little over 2% of the dataset.

Training a logistic regression model on the data as is leads to the following results:

An area under the curve of 84% in the ROC chart makes it seem like the model is doing a good job, but the Precision-Recall curve and the Confusion Matrix tell a very different story:

The AUC of the PR Curve is only 53% and we can see that only 28% of the fraudulent accounts were identified as such by our model.

Weighting the classes

Let’s try weighting the classes when we set up the classifier, and inspecting how the confusion matrix changes:

clf = LogisticRegression(class_weight={0:1,1:5})

So, here we’re putting 5x more weight into the positive class than the negative one, here’s the confusion matrix for this parameter:

It’s quite a lot better, now correctly classifying the fraudulent accounts 48% of the time, without sacrificing any true negatives. Let’s try a different weight:

clf = LogisticRegression(class_weight={0:1,1:20})

With 20x the preference towards the positive class, here’s the confusion matrix:

Again, the detection of fraudulent accounts has improved, this time to 60%, but now we’re incorrectly classifying a small percentage of negative instances.

Let’s try one more, weighting the positive class by 100 this time:

clf = LogisticRegression(class_weight={0:1,1:100})

Okay, so our fears have been confirmed, increasing the weight of the positive class significantly increases true positive detection but causes a decline in true negative detection.

Balancing the dataset

Next we’ll use the excellent imbalanced-learn package to see how that effects our ability to detect fraud.

We’ll start with undersampling the negative class:

from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler()
X_resampled, y_resampled = rus.fit_sample(X_train, y_train)
clf = LogisticRegression()
clf.fit(X_resampled, y_resampled)

Note: it’s very important that you resample the data after splitting between training and test sets, failure to do so will make your model biased on your dataset and therefore will perform very poorly out-of-sample.

Let’s take a look at the outcome from undersampling the negative class:

Undersampling in this case seems to do worse than just altering the weights (the case where we weighted the negative class 20x.) Providing only a marginal increase in true-positives for significantly more false-positives.

Next, let’s try oversampling the positive class:

from imblearn.over_sampling import ADASYN
ada = ADASYN()
X_resampled, y_resampled = ada.fit_sample(X_train, y_train)
clf = LogisticRegression()
clf.fit(X_resampled, y_resampled)

ADASYN is one of the more advanced over sampling algorithms and provides us with easily the best model so far:

A 72% true-positive rate in exchange for only 9% false-positive. Let’s take a look at the Precision-Recall curve:

Only a 3% increase in area under the curve compared to our much worse performing first attempt. Now, let’s have a look at a method which combines under-sampling the positive class with over-sampling the negative one:

from imblearn.combine import SMOTEENN
smo = SMOTEENN()
X_resampled, y_resampled = smo.fit_sample(X_train, y_train)
clf = LogisticRegression()
clf.fit(X_resampled, y_resampled)

SMOTEENN combines SMOTE (synthetic over sampling) with Edited Nearest Neighbours, which is used to pare down and centralise the negative cases.

Here’s the confusion matrix after resampling using SMOTEENN:

Not a terrible model, but not as good as ADASYN alone.

Why didn’t balancing the dataset work?

We had imbalanced data. We fixed that. So why didn’t our model improve when we had the same number of instances for both the negative and positive cases?

To get a better idea of why, we can take a look at some plots that compare the classes across a pair of features:

In the first chart, you can clearly see that the fraudulent accounts (in yellow) for the most part lie outside the negative cases (purple).

How imbalanced learning algorithms generally work is to connect points of a certain class, and create new observations along the lines that connect those points. This works great when there is a clear spatial separation between the classes, but when there is not, you see something that looks very similar to the second chart.

Is it any wonder that we saw a lot of false-positives in our balanced dataset?

Finding the outliers

Let’s take a look at the feature by feature plots for each combination of features:

Looking at the graphs with a heuristic of identifying the outliers, we can see that our positive class causes outliers across a number of dimensions, as most of these graphs have some yellow points outside of the clump of purple points.

There is one exception, however, and that is the features x2 and x3 (the fifth chart).

This is a good indication that these features do not add any predictive power to our model when it comes to identifying outliers. So we should remove these before going any further.

The One-Class SVM

A One-Class Support Vector Machine is an unsupervised learning algorithm that is trained only on the ‘normal’ data, in our case the negative examples.

from sklearn.svm import OneClassSVM
train, test = train_test_split(data, test_size=.2)
train_normal = train[train['y']==0]
train_outliers = train[train['y']==1]
outlier_prop = len(train_outliers) / len(train_normal)
algorithm = OneClassSVM(kernel='rbf', nu=outlier_prop, gamma=0.000001)
svm.fit(train_normal[['x1','x4','x5']])

Notice that we’re only using the x1, x4 and x5 features, as these are the areas where the fraudulent accounts were most obviously outliers.

Training any kind of unsupervised learning algorithm can be difficult and the One-Class SVM in no exception. The nu parameter should be the proportion of outliers you expect to observe (in our case around 2%), the gamma parameter determines the smoothing of the contour lines.

I find it helps to have a plot like those above side-by-side with our model’s output, so that we can see how things change as we tweak the gamma parameter:

The actual visualisation:

x = test['x1']
y = test['x4']

plt.scatter(x, y, alpha=0.7, c=test['y'])
plt.xlabel('x1')
plt.ylabel('x4')

And the model’s predictions:

x = test['x1']
y = test['x2']

y_pred = algorithm.fit(train_normal[['x1','x4','x5']]).predict(test[['x1','x4','x5']])
colors = np.array(['#377eb8', '#ff7f00'])
plt.scatter(x, y, alpha=0.7, c=colors[(y_pred + 1) // 2])
plt.xlabel('x1')
plt.ylabel('x4')

So we have this, for gamma = 0.001:

Not quite accurate, let’s try gamma = 0.0001:

It’s getting better, let’s skip ahead until we get to gamma = 0.000001

This looks like a really close match, fantastic!

Let’s have another look at our confusion matrix:

A perfect 100% true-positive rate in exchange for only a 3% false-positive rate.

Success!

I hope that you’ve found this exploration of One-Class SVMs helpful, please don’t hesitate to get in touch if you have any questions.