For the past 10 years, I’ve been working with businesses of all shapes and sizes. I’ve worked on problems that ran the gamut from simple to incredibly complex. All that time, I’ve been trying to extract a framework for achieving results in analytics.
Of course, machine learning is an amorphous beast - it’s always growing and changing. And though my work has been resilient to systematisation, there is a methodology that I’ve adopted with recent clients that has proven itself useful, and I’m going to present it here.
Before we get to that, though, I just want to highlight some guiding principles of my work which say a little more about why I do what I do, and what the usual settings are.
I refuse to do any predictive work that reduces people to mere numbers. The thought that keeps me up at night is the fear that I’m contributing to a group of technologies that will accelerate the race to the bottom and rob honest, hard-working people of their individual sovereignty, rights, and their ability to act in their own best interest.
What I mean by this is, I will not build any system that preys on lowest common denominator emotions. Fear, addiction, lust, you know the type. I love to help businesses meet their goals but I will not spend any fraction of my career capturing eyeballs or decimating self-confidence.
Data science, and AI more generally, is supposed to make things better. It’s supposed to make the laborious aspects of our lives easier so that we can focus on more important and creative endeavours. It is not supposed to put up arbitrary barriers to gainful employment, generate addictively watchable videos for children, or be used to manufacture politically dangerous spoofs.
And so, I don’t do anything like that.
Some people may have the ability to work on a data science problem without understanding where the data came from and what the end goal is. I am not one of those people.
I prefer my machine learning in a soup-to-nuts, end-to-end fashion. I can’t fit algorithms to tidy datasets, report my results and call it a day. That’s just not how my brain works.
I understand that it’s difficult to have very large teams work in anything but a piecemeal fashion. I know that the separation of tasks leads to meaningful gains in efficiency. Nevertheless, for me to get really excited in solving a problem, I have to know why the problem exists.
I have to peek behind the curtain, I have to see the data sausage being made.
I firmly believe that anything is interesting if you go deep enough into it. And that’s what I aim to do. I’ll work with (and probably annoy) senior stakeholders, I’ll read research and find out how others have solved similar problems, and I’ll reformulate the question from every angle I can imagine before I begin work.
It’s all an experiment
For my money, the scientific method is one of the all-time greatest inventions. Analytics work can, and should, be a scientific undertaking at heart.
There are many people out there who are trying to retrofit Agile methodologies to data science projects. But I think the iterative approach can coincide more harmoniously with machine learning if you think in terms of building on results.
Scientists stand on the shoulders of giants - data scientists should do the same. This is not a rallying cry to ‘change the world’, but it is a request to use the paradigm of experimentation in your work.
I work iteratively - I do my best to build something that serves its purpose, and then I try harder next time.
Now we’re on to the main topic - How I do data science. This framework has been developed to serve my needs as an analytics practitioner and it has done that job very well, however, your mileage may vary.
It’ll come as no surprise to anyone who’s read the above that this basic outline includes two elements that may not apply to everyone; the end-to-end focus, and an iterative approach. If these do not apply to you and how you work, I hope that you’ll still get some benefit from seeing how I work, and I’d be interested in hearing about your chosen methodologies.
Step One - What is the problem?
We’re not saving the best for last - This is the most crucial step of the whole framework. A lot of articles, tutorials, and even books ‘teach to the algorithm’ and assume that the right algorithm, applied indiscriminately, will solve your issue.
I can’t tell you how many risk analysis type projects I’ve worked on. From fraud risk to credit risk, from churn risk to theft risk. None of them have been the same as any others. In some companies they want to mitigate risk, in others they want to account for it. In some businesses the risk will make or break them, in others the project is a footnote in their AI explorations.
These are all meaningful distinctions that the articles, tutorials, and books won’t tell you about.
The only way I’ve found of working through these options is to talk about them. You have to talk to the people who will be using the system and the people who will be affected by the system (yes, including customers).
Only once you have an idea of the people involved can you go any amount of the way to solving the problem. All business problems are people problems, no matter how ‘technically advanced’ they are.
Step Two - Data
So much of the online chatter focusses on feature engineering and data cleaning. No shit, you have to have data. Our problem now though is the sheer abundance of data.
This step of the project is about statistics. It’s about significance testing, hypothesis testing, factor analysis, power analysis and all that stuff which answers the question - will this project work?
If you haven’t captured (and cleansed) the right data, you’ll never make progress. This step may start and end in SQL (or Spark, or Hadoop etc.) but it takes a very important digression through statistical inference.
Step Three - It’s alive
This is the part where all the data work comes together. You have a hypothesis and some statistically-sound data to back it up. Now comes fitting a model.
Fitting a model is the fun part of data science, I suppose. It’s the downhill rush to a working product. It’s the terminus of your data gathering, cleaning, and analysis efforts, and it rewards you with (hopefully) mildly accurate predictions.
Out of a few candidate algorithms, I’ll select one that seems to have the most promise. I’ll pick it up and look at it from every possible angle, I’ll do cross validation, grid search, I’ll try different implementations, I’ll make metrics charts until I’m satisfied that the model is working.
Step Four - Communication
Very frequently, clients (or managers, or coworkers) will want to know what you’re going to do before you’ve done it. They want to know which methods you’ll use, what kind of results they can expect, and they’d like to be warned about any potential pitfalls.
Until I’ve done the previous step, I can never say for sure what method I’ll end up using. I won’t be able to say with any level of accuracy what the results will be. And I won’t be able to foresee any problems.
You might think that this sounds amateurish. Let me clarify, I do know what types of things can go wrong. I communicate frequently with my clients and together, we would have worked out a rough approach, and settled on performance benchmarks that they’re expecting to hit.
But, in my experience, all of the estimates you make before a machine learning project will be wrong. And so this stage is a communicative phase where I discuss everything I tried and, together with the client, analyse everything that went wrong.
I’ll tie my efforts back in to their core business metrics, to show the upside. And we’ll make a plan for implementation.
Step Five - Implementation
The final phase of the project is implementation. At this point, we’ve accomplished the following things:
- We’ve understood the business problem from first principles and discussed it at length
- We’ve sourced, cleaned, and rigorously tested our dataset
- We’ve built several models, selected a candidate and optimised its performance
- We’ve discussed how it works, why it works, and how well it is going to work in the wild
Any of those previous steps can find their way in to this last stage. That’s a lesson I’ve had to learn the hard way.
You don’t want to tune your hyperparameters just before going live, and you definitely do not want to have to explain the value of your predictive model to the CTO, just one more time, before your borrow one of their developers for a few hours to tie everything together.
This step is about implementation only. It’s about delivering a solution that works.
Using this process, I’ve been able to deliver solutions that work for a decade.
These systems work because the problem was well understood, the solution had buy-in, the data was sufficiently examined, and the model was optimised for business performance.
If you’re thinking about carrying out a machine learning project, you might be interested in my Deep Dive - a 1 week training and consulting engagement that follows the above framework and puts you on the road to AI project success.