When you undertake feature engineering for a new project, two outcomes are most likely:

  1. You capture the right features for your problem, but accidentally include highly-correlated or duplicated features with no predictive capacity.

  2. You engineer all the wrong features.

Assuming you have domain expertise (as crucial as technical chops to a successful predictive analytics project, if not more so), it’s more likely that you’ll fall into this first group - you won’t know which features are the most beneficial.

The most common way I’ve seen data scientists proceed is to use Recursive Feature Engineering or something else from the feature_selection module of scikit-learn.

The problem with this, however, is that you’re basing your feature selection on how much it helps the model, and not on anything related to the ‘ground truth’ of the particular phenomenon you’re trying to model.

This is one of the key mistakes that will lead the statistically-inclined to launch into finger-wagging sermons. But what can you do to find out more about the ‘ground truth’ of your features?

And what, specifically, should you be on the look out for?

An introduction to multicollinearity

Confusingly enough, multicollinearity is also sometimes called collinearity. Fortunately, though, both of these terms can be understood syllable by syllable. Multi-co-linearity = many shared lines.

A more technical definition would be something like the following: from a pair (or more) of features, you could use one of them to fit a linear model that appreciably approximates the other.

As a simple example, imagine that you were a property developer building a block of apartments in a major city. You would like to forecast the purchasing propensity of local individuals. You know from your research that your ideal customer works in a professional capacity, has no children and has previously been unable to purchase a property due to restrictive lending requirements.

You collect lots of data on these individuals and you build a complex model that takes into account factors like their time in employment, their commute time, their salaries, their hobbies and their disposable income.

Did you see where multicollinearity crept in?

If you used both salary and disposable income in your model, you’d very likely end up with multicollinearity. After all, one of the key things we know about these potential customers is that they have not experienced any substantial changes in living expenses (marriage, child-rearing, property ownership etc.)

The problem with multicollinearity

It’s not easy to see why this should be an issue for a model. It’s certainly not the case that this will reduce the predictive accuracy of your model on the test set. So what’s the problem?

The problem is that multicollinearity can, and often does, change. Features that were once highly correlated can become less so, and the features in your model that bolstered one another and helped you make those crucial sales later end up disagreeing with one another, sending your model’s accuracy into a tailspin.

Multicollinearity has some interesting side-effects:

  • It can lead to radical changes in coefficients when features are added or subtracted from a model.
  • It can lead to one explanatory variable having a very near-zero coefficient, even though a simple regression model using just that variable and the target has a coefficient greater than zero.
  • It can lead to significant changes in the coefficients of a model when it is re-trained with slightly perturbed data (that is, data with small amounts of noise added.)

All of this is a long way of saying that multicollinearity is the death-knell of a robust predictive model.

Variance inflation factors to the rescue

A variance inflation factor (VIF) is a number, a ratio (to be more precise), that describes the variance of a model with multiple features in comparison to a model with one feature.

In the case of our property development scenario from above, we would expect that the variance in salary in the model using all of the features to be some large (>5) multiple of the variance in salary in a model using salary alone as a predictor.

The reason that this variance is present can be understood with a little bit of linear algebra intuition. If two columns of a matrix are linearly-dependent, you get no ‘new space’ from the extra column, no new solutions. A 2D plane in 3 dimensions will have a larger group of coherent solutions as it’s coefficients are tweaked than a line will in the same space. This ‘incoherence’ is the variance. (If this seems like mystical hand-waving, please check out Gilbert Strang’s excellent Linear Algebra course on MIT OCW.)

An Example

Enough theory! Let’s see multicollinearity and VIF in action.

Here’s our data:

commute_time_hours disposable_income salary years_in_employment max_deposit
2 16000 45300 9 99000
2 14000 55600 8 105000
0 22000 76400 6 128000
0 20000 70400 19 146000

Our goal here is to build a robust linear regression model that predicts an individual’s maximum house deposit based on their other attributes.

Using the off-the-shelf LinearRegression from scikit-learn, we can achieve an RMSE (root mean squared error) of 316.85. That’s very good considering we’re dealing with a target variable that takes values of ~10^5 and up.

But let’s take a look at the coefficients of the model:

print(regr.coef_)
RMSE commute_time_hours disposable_income salary years_in_employment
316.85 703.701 0.997 1.000 2001.831

One of the outcomes of multicollinearity that we discussed before is the variation in coefficients of explanatory variables. If we investigate the correlation between the target variable and our disposable income feature, we’ll observe a high degree of correlation:

The feature has a positive .906 correlation with our target variable. Let’s try training another model, this time removing the salary feature that we suspect is introducing multicollinearity.

RMSE commute_time_hours disposable_income years_in_employment
8453.45 1011.35 3.585 1967.442

This time, our RMSE is much larger at 8453.45. But, that’s not what we were looking for. The coefficient of the disposable income variable has increased 350% to 3.585. Yikes! The coefficient of the commute_time_hours variable also increased 150%.

Another side effect of multicollinearity is that perturbations in the data will also dramatically affect the coefficients. Let’s take a look:

commute_time_hours disposable_income salary years_in_employment
445.838 0.423 1.016 1935.815

After introducing small amounts of noise into the data, we can see that two of the coefficients in our model have nearly halved! Not very robust.

Variance inflation factors to the rescue (again)

Let’s go back to our original model, the high performing one. We can building a DataFrame of VIF scores like so:

from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif["VIF Factor"] = [
  variance_inflation_factor(data.values, i) for i in range(data.shape[1])
]
vif["features"] = data.columns

Inspecting the vif DataFrame, we can see the following:

Features VIF Factor
commute_time_hours 2.752078
disposable_income 68.084610
salary 267.006743
time_in_employment 32.609734

Those are some very big numbers. Ideally, each VIF Factor should be below 5, or at least below 10. So we definitely have some multicollinearity here. Perhaps even with the time_in_employment variable, which we did not expect, but which might be explained by the fact that salary can be somewhat predicted from career length.

What to do next very much depends on your problem domain. Do you think that disposable income is likely to significantly increase or decrease due to upcoming macro-economic changes? If you don’t think that’s likely, you might not have to reduce multicollinearity in your model at all.

If, on the other hand, you see a reasonable chance of that happening and are also concerned about the high VIF of the time_in_employment variable, it might be worth retraining the model less one variable.

But which one to choose?

In practice, you should prefer to drop the feature that takes the most effort to collect. In this case, that’s probably going to be disposable_income. People know their salaries off the top of their heads, their disposable income? Not so much.

Let’s try a final model, using just commute_time_hours, salary and time_in_employment.

Our RMSE this time is higher (1000% more, in fact) at 3233.08, but that’s still reasonable given the order of magnitude of our target variable.

The coefficients look like this:

RMSE commute_time_hours salary years_in_employment
3233.08 616.568 1.332 2007.461

Let’s reintroduce the noise, to the same degree as before, as inspect the coefficients again:

RMSE commute_time_hours salary years_in_employment
3403.61 664.043 1.336 2006.040

Those look much more like small perturbations to me!

And here is our VIF DataFrame reconstructed on the data minus the disposable_income feature:

Features VIF Factor
commute_time_hours 2.272
salary 3.726
time_in_employment 3.154

Everything comfortably below 5, including the time_in_employment variable which had a somewhat mysteriously high VIF. Excellent, we’re removed the multicollinearity.

The final analysis

Obviously, the point of all of this was not to just reduce the VIFs for their own sake. We wanted to make our model more robust.

Remember our RMSE from the first model? It was an exceptional 316.85, now let’s simulate a fairly radical change in the disposable_income, by randomly increasing or decreasing out-of-sample disposable incomes by an int between 1 and 4999.

When we do this, the RMSE skyrockets to 20376.79. An increase of around 6600%.

If we apply the same kind of radical change to the model that excludes disposable_income, or more importantly, has no multicollinearity, we see the RMSE increase from 3233.08 to 11472.38, an increase of just 300%.

I hope that this deep dive into multicollinearity has helped you understand an important way to make your model more robust, if you have any questions, please get in touch.