No matter if I’m speaking to a client, a student, or a distant family member, people always ask me for examples of how I’ve applied Machine Learning in the real world. It seems that even though we’re being bombarded by articles and tutorials, that some context is missing. In this article I’m going to discuss three (semi-)recent projects of mine so that you can better understand how machine learning and data science works in practice.

1. Scanning eBay for counterfeit / stolen goods

I can’t name my clients (I take NDAs very seriously) but let’s just say that this client is in the entertainment industry. And that they have a dedicated team of people scouring listing sites (like eBay) looking for stolen or counterfeit goods that bear the client’s branding.

Of all the millions and millions of listings, only a very small fraction are of interest to the client. The system they have in place scrapes the webpages containing relevant keywords and stores the pictures in a particular folder. Each day, the team logs in to the custom built web portal and looks at these pictures one after the other to determine if they infringe on the business’ copyright.

This is a very expensive (tangible) and boring (intangible) process that could be much improved with a little machine learning.

A lot of the tutorials that discuss computer vision focus on quite un-businesslike problems. Cats vs Dogs, Cars vs Motorcycles, Hand-drawn digits, etc. But in the case of this project, we have to look for something in the images that signals copyright infringement.

Deciding which machine learning paradigm to work on a problem with is often complex and this project was no different. On the one hand, we have a large amount of data that’s been labelled as either infringement or non-infringement and, in that case, it’s very easy to see how this could become a classification problem.

On the other hand, because of the design of the web portal, retrieving negative examples is time consuming (it doesn’t store the images, just a link) and so we have a restricted dataset. In addition, the client’s logos often appear in the images alongside the logos of other businesses (copyright infringers are rarely subtle) and there’s a strong chance that a sufficiently deep neural net would recognise the presence of something that looks like a logo rather than the individual company’s logo alone (in the same way that neural nets learn to see all dog breeds rather than just German Shepherds).

In this instance the key was training a shallower convolutional neural network for object detection instead of classification. This prevented the network from learning logo-general features and forced it to learn logo-specific features.

The training set took a lot of effort to produce, being made up of a huge set of partial images from the original training set at various skews and alignments, each with a bounding box drawn around the object that we wanted to detect (the logo).

2. Predicting if a retail store will be burglarised

A multinational retail chain hired my company to build a web application that was able to generate burglary risk scores based on historical instances of burglary.

Because of the nature of the business the client felt that the risk of burglary increased with crime in the surrounding area more so than it did because of any specific features of the store (easily breakable windows etc.)

The purpose of the tool was not to prevent burglaries directly but to assess how to apportion a fixed budget over the entire portfolio of stores to reduce burglary losses in total. Because of this, the real metrics the system had to deliver was the benefit over replacement of each potential security enhancement, kind of like a recommendation engine.

Similar to our last example, working out how to frame the problem so it could best be solved was one of the biggest challenges. We tried survival analysis at first, customising the Cox Proportional Hazards model to accept multiple events, and while it was great at providing insight into how protective various security enhancement were, it was very unaccepting of geographic trends which were an important factor the client wanted to understand.

Finally our team settled on building individual classifiers (using XGBoost) for each of the target prediction periods the client wanted us to cover, using a fixed training set date range for each and with the understanding that the longer period predictions were more likely to overestimate long-term risk.

For the recommendation engine we used linear regression to calculate an estimated loss per retail store and used that metric (combined with geographic metrics) to generate specific recommendations.

3. Recommending content to 150 million monthly visitors

A very popular content website moved to an infinite scroll experience and had trouble working out which article to show next. The approach they came up with was very similar to reinforcement learning - they chose to present one of the 10 most popular articles or a random selection.

Unfortunately this idea didn’t provide them with the conversion rate they’d hoped for and the infinite scroll significantly decreased their ad clicks (even though it was a better experience for the user).

Using word embeddings and user embeddings, we were able to build a collaborative filtering recommendation engine (in pure Numpy) that gave users articles catered to their interests rather than just the most popular ones.

The biggest challenge with this project was not the Which Paradigm? question but rather the sheer volume of the data. We had to work very hard at making sure the system returned recommendations very quickly - that’s the essence of the infinite scroll experience.

Another challenge was storing the diagnostic and auditing data for the model, as each input and output was many thousands of features long (typical of embeddings). Having a pessimistic outlook on whether or not our model would behave once deployed (an attitude I encourage everyone to share), we needed to work out a way of storing this information without storage costs growing exponentially.

We decided to split the model by site (they have multiple sites), category, and user info to reduce the embedding matrices size. We also deployed a variety of dimensionality reduction techniques to make this more manageable, and monitored the system closely to determine the ideal cutoff date for the backups.

I wanted to write this article to show that model selection is rarely the most important stage of a machine learning project (in the real world) and that client requests, data types and velocities, prediction usages, and even database optimisation all factor in to the success of a project.

I hope this was helpful - if you have any questions, please get in touch.