A lot of the projects I work on are time-bound in one way or another. My clients need to know the churn rate next week, the risk of fraud next month, their anticipated revenue next quarter. But what features does a model need to do this well?
Feature engineering is one of the most creatively challenging aspects of a data science project. When you follow a tutorial or read a book, it’s easy to forget that someone had to go through the difficult work of creating the features you use in your model.
In practice, creating features from raw data requires a great amount of foresight and some intuition about what would help your model do its job best. One the best ways I’ve found to increase the accuracy of a time-based predictive model (especially one that’s trained on an imbalanced data set) is to use slopes.
What are slopes?
Simply put, slopes are the numerical features that describe a general trend in your data.
To calculate the slopes, you take the values of some feature across fixed time intervals (days, weeks, months etc.) and subtract the first value from the second, the second value from the third and so on, and then you average across the resulting array. That’s the average slope of the line that fits your data.
Fortunately, you can do this in 1 line of Python:
Say you had your customers clustered into 5 groups (maybe they’ve come from different referral sources or are on different paid plans.)
You’ve noticed that 2 groups in particular are likely to churn, but on different time horizons. The average lifetime of one group is a month, whereas it takes the other 3 months on average.
Here’s our example data:
|Group||1 Month Churn||3 Month Churn|
Doing some exploratory analysis, you notice that Group 1 and 2 are exceedingly more likely to churn over a 3 month period:
So you go ahead and train a Logistic Regression algorithm to predict 3 month churn, with the customer’s group as the feature. That ends up being a very solid model:
Your ROC AUC and PR AUC are both over 90%, that’s an excellent model!
You’re enthused about your results, but it turns out that your boss would really like a 1 month churn model built. You go right ahead and do the same thing as before, but this time, you get a significant decline in your results:
The AUC of your ROC Curve declined 5% and the AUC on the PR Curve fared much worse: a 14% reduction.
You decide to go back and have a look at the distributions, by group, for the 1 month churn column. Can you see where the problem is?
The churn rates are not so clear cut this time round. The chance of a Group 2 customer churning is just over half and anything near half is difficult to predict.
This is traditionally where you’d go back to the drawing board and try to work out which additional features you could use, and that’s exactly what we’re going to do.
The first thing we have to do to calculate slope is find the week-by-week (in this case) churn proportions for each group. Something like this:
|Group||Week 1||Week 2||Week 3|
Then, we create a function to use with Pandas apply:
def calculate_slope(row): return np.mean(np.diff(row))
history_slopes = history.apply(calculate_slope, axis=1)
And we end up with something like this:
print(history_slopes) 'Group 1' 5.967922 'Group 2' 0.657376 'Group 3' 0.024669 'Group 4' 0.123684 'Group 5' -0.164833
We can join this new DataFrame to our original data, and then retrain the LogisticRegression model, using this slope as a new feature. Here’s what happens:
Using just this 1 additional feature has allowed us to recapture nearly all of the AUC for the Receiver-Operator Curve, and over half of the AUC for the Precision-Recall Curve.
In this model I used the average slope for the churn per group over the last 12 weeks, but you can and should try different time periods and test these slope features in different combinations.
I hope I’ve showed you how powerful trend analysis and slope calculations can be as features, if you have any questions get in touch!