# Geographic features in SQL

Something that comes up surprisingly often in my data work is the idea of capturing local (in the geographical sense) patterns. Whether it be modelling an individual’s likelihood to make a purchase based on their neighbours activity (classical Keeping Up With the Joneses!) or predicting crime risk using local crime history.

In the UK, this often means working with postcodes, as clients very rarely have longitudes and latitudes to hand. Luckily, there is a website called Freemaptools, which has a mapping of all (or very nearly all) UK postcodes to longitude and latitude pairs.

But once you have the mappings, what should you do?

The first thing I will try when I’m doing this kind of modelling is a COUNT() or AVG() over some different radii surrounding a point of interest.

Most often than not though, there’s not just one point of interest, you ideally want to compare every point to each other. A very computationally expensive query when it’s done the wrong way!

Here’s how I will usually tackle this kind of thing:

### Step One

Get the ukpostcodes.csv file into the database (bulk insert, pandas.to_sql, or some other method)

### Step Two

Create an index on the inserted table:

```
create index post_codes_index on post_codes(postcode(12))
```

### Step Three

Join the postcode laden data to that to get the long/lats for the points of interest, making a new table:

```
create table points
as
select p1.id, p1.post_code, p2.latitude, p2.longitude
from post_codes as p1
join uk_post_codes as p2 on p1.post_code = p2.post_code
```

### Step Four

Use a self-join to find the distance from each point to every other:

```
create table point_distances
as
select p1.id, p2.id, p1.post_code, p2.post_code, (
3959 * acos (
cos ( radians(p1.latitude) )
* cos( radians( p2.latitude ) )
* cos( radians( p2.longitude ) - radians(p1.longitude) )
+ sin ( radians(p1.latitude) )
* sin( radians( p2.latitude ) )
)
) AS distance
from points as p1
join points as p2
having distance between 0.001 and 25
```

That long formula is a rough approximation to the distance between the points in
miles. I have found it to be accurate *enough* over the distances I’ve needed but
YMMV.

It’s important to exclude points with a distance of <0.001, as this will match each point with itself, which is usually an undesirable consequence.

You should set the maximum distance in the having clause to be the maximum distance
you want to use in your features. The benefit of this is that you only have to run
this query once, and you can use a simple `where distance < X`

when necessary.

For example, using this `point_distances`

table, you could join to a table
describing purchases and find the most recent purchase within some distance:

```
select p1.id,
datediff(@base_date, max(p2.purchase_date)) as most_recent_within_15_miles
from point_distances as p1
join purchases as p2 on p1.id = p2.id
where p1.distance < 15
group by p1.id
```

And:

```
select p1.id,
datediff(@base_date, max(p2.purchase_date)) as most_recent_within_1_mile
from point_distances as p1
join purchases as p2 on p1.id = p2.id
where p1.distance < 1
group by p1.id
```

I have found these geographic features to be of great use when predicting events that are heavily influenced by their surrounding environment. If you ever have access to data that has postcodes, I suggest you give calculating them a go and see how well they correlate with your target variable, you may have a network effect on your hands without even realising it!