OpenStreetMap case study for Austin Texas

First Steps

Our first step, after downloading and unzip the data which resulted in a 1.3 gig file named austin_texas.osm, is to just take a quick peak at it, we do this with a quick head command:

λ head -n 15 austin_texas.osm 

We see, lot's of XML that we're about to start exploring. The most common tag looks to be node , and I'm starting to see some structure to it. It appears some of the tags denote traffic signals and highways ( attributes "k" and "v" ).

<node id="26546028" lat="30.4773514" lon="-97.800246" version="26" timestamp="2015-07-02T23:17:00Z" changeset="32376887" uid="2936384" user="OneLeggedOne"/>
<node id="26546029" lat="30.4757513" lon="-97.8006601" version="12" timestamp="2011-07-26T17:24:44Z" changeset="8837814" uid="207745" user="NE2"/>
<node id="26546030" lat="30.4752677" lon="-97.8003939" version="6" timestamp="2012-10-09T01:13:38Z" changeset="13420621" uid="119881" user="claysmalley"/>
<node id="26546035" lat="30.4736773" lon="-97.7989442" version="5" timestamp="2009-03-09T08:46:52Z" changeset="767484" uid="105002" user="APD"/>
<node id="26546038" lat="30.4752704" lon="-97.7996651" version="10" timestamp="2012-03-14T05:28:59Z" changeset="10974544" uid="119881" user="claysmalley"/>
<node id="26546039" lat="30.472807" lon="-97.7989177" version="26" timestamp="2011-07-26T17:24:44Z" changeset="8837814" uid="207745" user="NE2"/>
<node id="26546041" lat="30.4794276" lon="-97.8033397" version="25" timestamp="2012-10-09T01:16:50Z" changeset="13420621" uid="119881" user="claysmalley">

        <tag k="highway" v="traffic_signals"/>

So we're starting to get a picture of what our data looks like and it contains. Let's dive right in.

As we see the top result is addr:housenumber , representing 85K houses in the Austin area, followed by streets , postcodes, and highway markers. Interesting we also see a lot of power lines, almost 6K of them. Further down on the list we see a almost 1000 city items listed, which is interesting because Austin typically only uses Austin proper as it's city name, there aren't many actual municipalities around the area, so let's take a look at how those city's break down with

select DISTINCT _value, count(id) FROM nodes_tags WHERE _key = 'addr:city' GROUP BY _value ORDER BY count(id) DESC LIMIT 25
 
 
City Count
0 Austin 874
1 Round Rock 26
2 Kyle 20
3 Cedar Park 12
4 Buda 10

( The Original Set contained Austin, TX ... etc for 44 rows )

In the original output of this table above, there were many city names that also include ', TX'. Let's update these so we only get pure city names and run again.

UPDATE nodes_tags SET _value = SUBSTRING_INDEX(_value,',',1) WHERE _key = 'addr:city' AND LOWER(_value) LIKE '%, tx'
 
 
City Count
0 Austin 874
1 Round Rock 26
2 Kyle 20
3 Cedar Park 12
4 Buda 10
5 Pflugerville 8
6 Dripping Springs 5
7 Bastrop 5
8 Leander 4
9 Georgetown 3
10 Del Valle 2
11 Elgin 2
12 Taylor 2
13 Spicewood 2
14 Lakeway 2
15 Lago Vista 2
16 West Lake Hills 2
17 Wimberley 1
18 Manchaca 1
19 Cedar Creek 1
20 Hutto 1
21 Westlake Hills 1
22 Barton Creek 1
23 Smithville 1
24 Maxwell 1

As we see the majority are in Austin proper, with ~60 rows for surrounding cities. Let's move on and see what kind of stores are listed in these cities.

select DISTINCT _value, count(id) FROM nodes_tags WHERE _key = 'shop' GROUP BY _value ORDER BY count(id) DESC LIMIT 10
 

We see lot's of convenience stores listed, let's take a look at the most popular ones.

SELECT _value, count(_value) from nodes_tags where _key = 'name' AND nodeId in ( select DISTINCT nodeId FROM nodes_tags WHERE _value = 'convenience' ) GROUP BY _value ORDER BY count(id) DESC
  • Note: When performing this query initially, we saw duplicates in the form of 7-11 and 7-Eleven. I have update all duplicates with MySQL UPDATE queries.
 

Restraunts

Curiously absent from the types of stores above is restaurants. After some searching in the .OSM file, it turns out amenity is actually a category for things like restaurants, place_of_worhsip, schools and so on. Let's rerun our SQL to search for restaurants.

SELECT _value FROM nodes_tags where _key='name' AND nodeId in ( SELECT DISTINCT nodeId from nodes_tags where _key = 'amenity' and _value='restaurant') GROUP BY _value LIMIT 15
 
 
Value
0 1431 Cafe
1 24
2 888 FUSION ASIAN CUISINE
3 888 Restaurant
4 A+A Sichuan China
5 ABGB
6 Ajishin
7 Antonio's Tex-Mex
8 Applebee's
9 Arandas
10 Arirang
11 Art of Tacos
12 Asia Cafe Sichuan
13 Aster's Ethiopian Restaurant
14 Asti

Ok! We now have an alphabetical list of restaurant. Let's see which has the most stores in Austin.

SELECT DISTINCT _value, count(_value) FROM nodes_tags where _key='name' AND nodeId in ( SELECT DISTINCT nodeId from nodes_tags where _key = 'amenity' and _value='restaurant') GROUP BY _value ORDER BY count(_value) DESC LIMIT 15
 

Now let's take a look at the most popular cuisine types with

SELECT DISTINCT _value, count(_value) FROM nodes_tags where _key='cuisine' AND nodeId in ( SELECT DISTINCT nodeId from nodes_tags where _key = 'amenity' and _value='restaurant') GROUP BY _value ORDER BY count(_value) DESC LIMIT 5
 

No surprise here, you can't throw a rock in this town without hitting a Mexican food place.

Let's take one more look at our data and look at churches. Let's get the number of churches as well as the number of different religions practiced here in Austin.

SELECT count(id)  FROM nodes_tags WHERE _key='name' AND nodeId in ( SELECT DISTINCT nodeId from nodes_tags where _key = 'amenity' and _value='place_of_worship') ;

And we see have 409(!) Churches in the Austin area. With a population of 885,400 in Austin Texas, that means one church for every 2,170 people.

Now let's take a look at the different type of religions these churches adhere to.

 
 
Value Count
0 christian 383
1 buddhist 6
2 muslim 2
3 jewish 2
4 unitarian_universalist 1
5 bahai 1
6 eckankar 1
7 hindu 1

We see an overwhelming number of Christian Churches, with our second most popular religion being Buddhist. Which might seem as a surprise to people that don't live here, but there is a very large Buddhist population in Austin , Texas.'

Additional Ideas

Through our SQL exploration we have looked at a great many of places that have been mapped by the OpenStreetMap team.

Some other neat ideas might be to overlay a population map over the data ( using a radius from a set of lat/long pairs ), to see what kinds of things appear more often where populations tend to cluster, and what lies in more rural regions.

Another similar idea would be to map traffic accidents against the OSM data set and see if there is any correlation between high areas of Feature X ( traffic lights, bridge crossings, railway tracks ) and traffic accidents.

The biggest challenge with this would be creating the radius code for a given latitude and longitude. Does a given lat/long pair fall with a 1 mile circle of another lat/long pair ? This would take some creative research to find out exactly how this algorithm should work.

Wrapping it up

Well we've done a lot of exploring, let's wrap it up by looking at a breakdown of contributions made to the OSM dataset .

Total Nodes: 6,363,868
Total Tags: 2,360,459

Most Contributions by a single UID: 2,473,754
Most Contributions by year: 2015 --- 5,608,684
Most Contributions by month: November --- 4,425,948
Most Contributions by postcode: 78645 --- 10,808
Most Contributions by user: patisilva_atxbuildings --- 2,473,754
Most Contributions by version: 1 --- 5,907,984
Most Contributions by changeset: 2674995 --- 57826

Number of highway markers: 13,952
Number of traffic_signals: 2,088
Number of bus stops: 1,039
Number of electrical towers: 3,517

Top tag counts per node with:

SELECT _value, count(nodes_tags.id) as cnt FROM nodes LEFT JOIN nodes_tags ON nodes_tags.nodeId =nodes.id  GROUP BY nodes.id ORDER BY cnt DESC    LIMIT 20

Wrangling the data.

Firstly, knowing that street names are likely to be a problem, we grep for a known abbreviation, like 'St.' with

λ grep -i 'v="st\."' austin_texas.osm 
Right away I notice that some things _start_ with St., indicating Saint, not Street ( such as Saint Edwards University ) . So let's see if Saint occurs in street names.
λ grep -i 'k="addr:street" v=".*st\."' austin_texas.osm

Now we see St. is only abbreviated as Street in addr:street tags. Executing this for .rd and pkwy, I find no other problems, so street endings are looking pretty consistent already. However all directions seem to be abbreviated as N,S,E & W, so we will clean those up as well to be full cardinal directions.

Zipcodes

While inserting we also notice inconsistent postalcodes, with many containing the state abbreviation TX as a prefix. We will use regular expressions and substrings to clean up the postal codes as we are iterating in order to clean this data by

  • Removing any non-numeric characters
  • Restricting zipcodes to only 5 digits

This consolidated the zipcodes into one format , updating things like ( 78758-7008 to 78758) , ( "TX, 78613" to 78613 )

Phone Numbers

Also present are a wide variety of phone number format styles, ranging from parentheses () to dashes (-) to dots (.). We will consolidate all these styles by removing anything non numeric from the phone number, so the only thing that is left is a 10 digit number representing the areacode, exchange and phone number.

We did this with by

  • A regular expression to replace anything non-numeric ( [^\d]+ ) with blanks .
  • Remove the country code (1) , if present.

This consolidated the phone numbers into one format , updating things like ( +1 512 322-9168 to 5123229168) , (512-442-0655 to 5124420655 )

Now we are ready to start iterating through our data and inserting into our MySQL database we will call austin_osm.

Creating the Schema.

Now that we have a function to clean the data, let's create our MySQL database. First we need to create our database and the tables.

λ mysqladmin create austin_osm  -uroot -pm
λ mysql -uroot -pm austin_osm < data_wrangling_schema.sql

Right away we see an error: ERROR 1064 (42000) at line 12: You have an error in your SQL syntax;

It seems we need to adjust our creation script to work with mysql, so we open the schema file and update our definitions.

Since we will be trying to create this several times as we fix errors, we combine all the commands into one line so we can execute it over and over ( using the fish shell )

λ  yes | mysqladmin drop austin_osm  -uroot -pm; mysqladmin create austin_osm -uroot -pm; mysql -uroot -pm austin_osm < data_wrangling_schema.sql

After some investigation, it appears 'key','type', and 'value' and 'position' are all keywords in mysql, so we will prepend them with _ to avoid collisions. In addition, the primary key syntax is different for mysql, we need to add PRIMARY KEY(id) to the end of the table creation definition to get it to work. Finally our table definitions is correct, and we can begin inserting data.

Inserting Data

Since we are using mysql, our first task is to get a MySQL driver. I am using an ananconda distribution, so I will install MySQLDb with

λ pip install mysql

I get an error with mysql_config not found, so we then install libmysqlclient-dev from ubuntu in order to install our driver.

λ sudo apt-get install libmysqlclient-dev 

With that installed and our python mysql driver installed, we get to the code.

The first error we hit when inserting data comes when we try to escape the data in the OSM file. It's an unusual one:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u0119' in position 16: ordinal not in range(128)

My initial idea is to just wrap this in a try {} block, and ignore anything with unicode characters since my system is not setup to handle them.

What I wound up doing was changing python and the database schema to handle unicode encodings so we could use the full data set.

-SIGKILL- Problems

Another problem I kept hitting was my long running python program kept dieing with an unexplained SIGKILL signal. After some research aI found that this is a common OOM ( out of memory ) error, and the python process is running until it exhausts all the memory on the system and is force killed by the kernel, rather than crashing the entire system.

After manually adding more SWAP space on my system ( 20 gigs! ) , it was able to run to completion.

λ sudo swapoff -a
λ sudo dd if=/dev/zero of=/swapfile bs=1M count=20000
λ sudo mkswap /swapfile; sudo swapon /swapfile

Final sizes

592M    ./austin_osm_data.sql
136M    ./austin_sample.osm
1.4G    ./austin_texas.osm

| db_name                     |       total  |
+-----------------------------+--------------+
| austin_osm                  | 811,892,736  |

Looking at the data

First let's just take a top level took at the unique tag names in the system with:

select DISTINCT _key, count(id) FROM nodes_tags GROUP BY _key ORDER BY count(id) DESC LIMIT 25;
 
 
Key Name Count
0 addr:housenumber 85054
1 addr:street 85032
2 addr:postcode 80957
3 highway 15616
4 power 6505
5 created_by 5051
6 name 4705
7 amenity 3797
8 coa:place_id 3730
9 public_transport 1745
10 ele 1412
11 source 1358
12 gnis:feature_id 1258
13 gnis:created 1033
14 gnis:state_id 994
15 gnis:county_id 994
16 addr:city 992
17 addr:state 897
18 bus 810
19 shop 692
 
 
Node Name Tag Count
0 Texas Disposal Systems (MRF) 17
1 Java Noodles 17
4 Conscious Cravings 15
5 McDonald's 15