Something that comes up surprisingly often in my data work is the idea of capturing local (in the geographical sense) patterns. Whether it be modelling an individual’s likelihood to make a purchase based on their neighbours activity (classical Keeping Up With the Joneses!) or predicting crime risk using local crime history.
In the UK, this often means working with postcodes, as clients very rarely have longitudes and latitudes to hand. Luckily, there is a website called Freemaptools, which has a mapping of all (or very nearly all) UK postcodes to longitude and latitude pairs.
But once you have the mappings, what should you do?
The first thing I will try when I’m doing this kind of modelling is a COUNT() or AVG() over some different radii surrounding a point of interest.
Most often than not though, there’s not just one point of interest, you ideally want to compare every point to each other. A very computationally expensive query when it’s done the wrong way!
Here’s how I will usually tackle this kind of thing:
Get the ukpostcodes.csv file into the database (bulk insert, pandas.to_sql, or some other method)
Create an index on the inserted table:
create index post_codes_index on post_codes(postcode(12))
Join the postcode laden data to that to get the long/lats for the points of interest, making a new table:
create table points as select p1.id, p1.post_code, p2.latitude, p2.longitude from post_codes as p1 join uk_post_codes as p2 on p1.post_code = p2.post_code
Use a self-join to find the distance from each point to every other:
create table point_distances as select p1.id, p2.id, p1.post_code, p2.post_code, ( 3959 * acos ( cos ( radians(p1.latitude) ) * cos( radians( p2.latitude ) ) * cos( radians( p2.longitude ) - radians(p1.longitude) ) + sin ( radians(p1.latitude) ) * sin( radians( p2.latitude ) ) ) ) AS distance from points as p1 join points as p2 having distance between 0.001 and 25
That long formula is a rough approximation to the distance between the points in miles. I have found it to be accurate enough over the distances I’ve needed but YMMV.
It’s important to exclude points with a distance of <0.001, as this will match each point with itself, which is usually an undesirable consequence.
You should set the maximum distance in the having clause to be the maximum distance
you want to use in your features. The benefit of this is that you only have to run
this query once, and you can use a simple
where distance < X when necessary.
For example, using this
point_distances table, you could join to a table
describing purchases and find the most recent purchase within some distance:
select p1.id, datediff(@base_date, max(p2.purchase_date)) as most_recent_within_15_miles from point_distances as p1 join purchases as p2 on p1.id = p2.id where p1.distance < 15 group by p1.id
select p1.id, datediff(@base_date, max(p2.purchase_date)) as most_recent_within_1_mile from point_distances as p1 join purchases as p2 on p1.id = p2.id where p1.distance < 1 group by p1.id
I have found these geographic features to be of great use when predicting events that are heavily influenced by their surrounding environment. If you ever have access to data that has postcodes, I suggest you give calculating them a go and see how well they correlate with your target variable, you may have a network effect on your hands without even realising it!