OpenStreetMap case study for Austin Texas

Austin street map data with resources from:

First Steps¶

Our first step, after downloading and unzip the data which resulted in a 1.3 gig file named austin_texas.osm, is to just take a quick peak at it, we do this with a quick head command:

λ head -n 15 austin_texas.osm

We see, lot's of XML that we're about to start exploring. The most common tag looks to be node , and I'm starting to see some structure to it. It appears some of the tags denote traffic signals and highways ( attributes "k" and "v" ).

<node id="26546028" lat="30.4773514" lon="-97.800246" version="26" timestamp="2015-07-02T23:17:00Z" changeset="32376887" uid="2936384" user="OneLeggedOne"/>
<node id="26546029" lat="30.4757513" lon="-97.8006601" version="12" timestamp="2011-07-26T17:24:44Z" changeset="8837814" uid="207745" user="NE2"/>
<node id="26546030" lat="30.4752677" lon="-97.8003939" version="6" timestamp="2012-10-09T01:13:38Z" changeset="13420621" uid="119881" user="claysmalley"/>
<node id="26546035" lat="30.4736773" lon="-97.7989442" version="5" timestamp="2009-03-09T08:46:52Z" changeset="767484" uid="105002" user="APD"/>
<node id="26546038" lat="30.4752704" lon="-97.7996651" version="10" timestamp="2012-03-14T05:28:59Z" changeset="10974544" uid="119881" user="claysmalley"/>
<node id="26546039" lat="30.472807" lon="-97.7989177" version="26" timestamp="2011-07-26T17:24:44Z" changeset="8837814" uid="207745" user="NE2"/>
<node id="26546041" lat="30.4794276" lon="-97.8033397" version="25" timestamp="2012-10-09T01:16:50Z" changeset="13420621" uid="119881" user="claysmalley">

        <tag k="highway" v="traffic_signals"/>

So we're starting to get a picture of what our data looks like and it contains. Let's dive right in.

As we see the top result is addr:housenumber , representing 85K houses in the Austin area, followed by streets , postcodes, and highway markers. Interesting we also see a lot of power lines, almost 6K of them. Further down on the list we see a almost 1000 city items listed, which is interesting because Austin typically only uses Austin proper as it's city name, there aren't many actual municipalities around the area, so let's take a look at how those city's break down with

select DISTINCT _value, count(id) FROM nodes_tags WHERE _key = 'addr:city' GROUP BY _value ORDER BY count(id) DESC LIMIT 25

( The Original Set contained Austin, TX ... etc for 44 rows )

In the original output of this table above, there were many city names that also include ', TX'. Let's update these so we only get pure city names and run again.

UPDATE nodes_tags SET _value = SUBSTRING_INDEX(_value,',',1) WHERE _key = 'addr:city' AND LOWER(_value) LIKE '%, tx'

As we see the majority are in Austin proper, with ~60 rows for surrounding cities. Let's move on and see what kind of stores are listed in these cities.

select DISTINCT _value, count(id) FROM nodes_tags WHERE _key = 'shop' GROUP BY _value ORDER BY count(id) DESC LIMIT 10

We see lot's of convenience stores listed, let's take a look at the most popular ones.

SELECT _value, count(_value) from nodes_tags where _key = 'name' AND nodeId in ( select DISTINCT nodeId FROM nodes_tags WHERE _value = 'convenience' ) GROUP BY _value ORDER BY count(id) DESC

Note: When performing this query initially, we saw duplicates in the form of 7-11 and 7-Eleven. I have update all duplicates with MySQL UPDATE queries.

Restraunts¶

Curiously absent from the types of stores above is restaurants. After some searching in the .OSM file, it turns out amenity is actually a category for things like restaurants, place_of_worhsip, schools and so on. Let's rerun our SQL to search for restaurants.

Ok! We now have an alphabetical list of restaurant. Let's see which has the most stores in Austin.

SELECT DISTINCT _value, count(_value) FROM nodes_tags where _key='name' AND nodeId in ( SELECT DISTINCT nodeId from nodes_tags where _key = 'amenity' and _value='restaurant') GROUP BY _value ORDER BY count(_value) DESC LIMIT 15

Now let's take a look at the most popular cuisine types with

SELECT DISTINCT _value, count(_value) FROM nodes_tags where _key='cuisine' AND nodeId in ( SELECT DISTINCT nodeId from nodes_tags where _key = 'amenity' and _value='restaurant') GROUP BY _value ORDER BY count(_value) DESC LIMIT 5

No surprise here, you can't throw a rock in this town without hitting a Mexican food place.

Let's take one more look at our data and look at churches. Let's get the number of churches as well as the number of different religions practiced here in Austin.

SELECT count(id)  FROM nodes_tags WHERE _key='name' AND nodeId in ( SELECT DISTINCT nodeId from nodes_tags where _key = 'amenity' and _value='place_of_worship') ;

And we see have 409(!) Churches in the Austin area. With a population of 885,400 in Austin Texas, that means one church for every 2,170 people.

Now let's take a look at the different type of religions these churches adhere to.

We see an overwhelming number of Christian Churches, with our second most popular religion being Buddhist. Which might seem as a surprise to people that don't live here, but there is a very large Buddhist population in Austin , Texas.'

Additional Ideas¶

Through our SQL exploration we have looked at a great many of places that have been mapped by the OpenStreetMap team.

Some other neat ideas might be to overlay a population map over the data ( using a radius from a set of lat/long pairs ), to see what kinds of things appear more often where populations tend to cluster, and what lies in more rural regions.

Another similar idea would be to map traffic accidents against the OSM data set and see if there is any correlation between high areas of Feature X ( traffic lights, bridge crossings, railway tracks ) and traffic accidents.

The biggest challenge with this would be creating the radius code for a given latitude and longitude. Does a given lat/long pair fall with a 1 mile circle of another lat/long pair ? This would take some creative research to find out exactly how this algorithm should work.

Wrapping it up¶

Well we've done a lot of exploring, let's wrap it up by looking at a breakdown of contributions made to the OSM dataset .

Total Nodes: 6,363,868
Total Tags: 2,360,459

Most Contributions by a single UID: 2,473,754
Most Contributions by year: 2015 --- 5,608,684
Most Contributions by month: November --- 4,425,948
Most Contributions by postcode: 78645 --- 10,808
Most Contributions by user: patisilva_atxbuildings --- 2,473,754
Most Contributions by version: 1 --- 5,907,984
Most Contributions by changeset: 2674995 --- 57826

Number of highway markers: 13,952
Number of traffic_signals: 2,088
Number of bus stops: 1,039
Number of electrical towers: 3,517

Top tag counts per node with:

SELECT _value, count(nodes_tags.id) as cnt FROM nodes LEFT JOIN nodes_tags ON nodes_tags.nodeId =nodes.id  GROUP BY nodes.id ORDER BY cnt DESC    LIMIT 20

Wrangling the data. ¶

Firstly, knowing that street names are likely to be a problem, we grep for a known abbreviation, like 'St.' with

λ grep -i 'v="st\."' austin_texas.osm

Right away I notice that some things _start_ with St., indicating Saint, not Street ( such as Saint Edwards University ) . So let's see if Saint occurs in street names.

λ grep -i 'k="addr:street" v=".*st\."' austin_texas.osm

Now we see St. is only abbreviated as Street in addr:street tags. Executing this for .rd and pkwy, I find no other problems, so street endings are looking pretty consistent already. However all directions seem to be abbreviated as N,S,E & W, so we will clean those up as well to be full cardinal directions.

Zipcodes¶

While inserting we also notice inconsistent postalcodes, with many containing the state abbreviation TX as a prefix. We will use regular expressions and substrings to clean up the postal codes as we are iterating in order to clean this data by

Removing any non-numeric characters
Restricting zipcodes to only 5 digits

This consolidated the zipcodes into one format , updating things like ( 78758-7008 to 78758) , ( "TX, 78613" to 78613 )

Phone Numbers¶

Also present are a wide variety of phone number format styles, ranging from parentheses () to dashes (-) to dots (.). We will consolidate all these styles by removing anything non numeric from the phone number, so the only thing that is left is a 10 digit number representing the areacode, exchange and phone number.

We did this with by

A regular expression to replace anything non-numeric ( [^\d]+ ) with blanks .
Remove the country code (1) , if present.

This consolidated the phone numbers into one format , updating things like ( +1 512 322-9168 to 5123229168) , (512-442-0655 to 5124420655 )

Now we are ready to start iterating through our data and inserting into our MySQL database we will call austin_osm.

Creating the Schema.¶

Now that we have a function to clean the data, let's create our MySQL database. First we need to create our database and the tables.

λ mysqladmin create austin_osm  -uroot -pm
λ mysql -uroot -pm austin_osm < data_wrangling_schema.sql

Right away we see an error: ERROR 1064 (42000) at line 12: You have an error in your SQL syntax;

It seems we need to adjust our creation script to work with mysql, so we open the schema file and update our definitions.

Since we will be trying to create this several times as we fix errors, we combine all the commands into one line so we can execute it over and over ( using the fish shell )

λ  yes | mysqladmin drop austin_osm  -uroot -pm; mysqladmin create austin_osm -uroot -pm; mysql -uroot -pm austin_osm < data_wrangling_schema.sql

After some investigation, it appears 'key','type', and 'value' and 'position' are all keywords in mysql, so we will prepend them with _ to avoid collisions. In addition, the primary key syntax is different for mysql, we need to add PRIMARY KEY(id) to the end of the table creation definition to get it to work. Finally our table definitions is correct, and we can begin inserting data.

Inserting Data¶

Since we are using mysql, our first task is to get a MySQL driver. I am using an ananconda distribution, so I will install MySQLDb with

λ pip install mysql

I get an error with mysql_config not found, so we then install libmysqlclient-dev from ubuntu in order to install our driver.

λ sudo apt-get install libmysqlclient-dev

With that installed and our python mysql driver installed, we get to the code.

The first error we hit when inserting data comes when we try to escape the data in the OSM file. It's an unusual one:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u0119' in position 16: ordinal not in range(128)

My initial idea is to just wrap this in a try {} block, and ignore anything with unicode characters since my system is not setup to handle them.

What I wound up doing was changing python and the database schema to handle unicode encodings so we could use the full data set.

-SIGKILL- Problems ¶

Another problem I kept hitting was my long running python program kept dieing with an unexplained SIGKILL signal. After some research aI found that this is a common OOM ( out of memory ) error, and the python process is running until it exhausts all the memory on the system and is force killed by the kernel, rather than crashing the entire system.

After manually adding more SWAP space on my system ( 20 gigs! ) , it was able to run to completion.

λ sudo swapoff -a
λ sudo dd if=/dev/zero of=/swapfile bs=1M count=20000
λ sudo mkswap /swapfile; sudo swapon /swapfile

Final sizes¶

592M    ./austin_osm_data.sql
136M    ./austin_sample.osm
1.4G    ./austin_texas.osm

| db_name                     |       total  |
+-----------------------------+--------------+
| austin_osm                  | 811,892,736  |

Looking at the data ¶

First let's just take a top level took at the unique tag names in the system with:

select DISTINCT _key, count(id) FROM nodes_tags GROUP BY _key ORDER BY count(id) DESC LIMIT 25;

	City	Count
0	Austin	874
1	Round Rock	26
2	Kyle	20
3	Cedar Park	12
4	Buda	10

	City	Count
0	Austin	874
1	Round Rock	26
2	Kyle	20
3	Cedar Park	12
4	Buda	10
5	Pflugerville	8
6	Dripping Springs	5
7	Bastrop	5
8	Leander	4
9	Georgetown	3
10	Del Valle	2
11	Elgin	2
12	Taylor	2
13	Spicewood	2
14	Lakeway	2
15	Lago Vista	2
16	West Lake Hills	2
17	Wimberley	1
18	Manchaca	1
19	Cedar Creek	1
20	Hutto	1
21	Westlake Hills	1
22	Barton Creek	1
23	Smithville	1
24	Maxwell	1

	Value
0	1431 Cafe
1	24
2	888 FUSION ASIAN CUISINE
3	888 Restaurant
4	A+A Sichuan China
5	ABGB
6	Ajishin
7	Antonio's Tex-Mex
8	Applebee's
9	Arandas
10	Arirang
11	Art of Tacos
12	Asia Cafe Sichuan
13	Aster's Ethiopian Restaurant
14	Asti

	Value	Count
0	christian	383
1	buddhist	6
2	muslim	2
3	jewish	2
4	unitarian_universalist	1
5	bahai	1
6	eckankar	1
7	hindu	1

	Key Name	Count
0	addr:housenumber	85054
1	addr:street	85032
2	addr:postcode	80957
3	highway	15616
4	power	6505
5	created_by	5051
6	name	4705
7	amenity	3797
8	coa:place_id	3730
9	public_transport	1745
10	ele	1412
11	source	1358
12	gnis:feature_id	1258
13	gnis:created	1033
14	gnis:state_id	994
15	gnis:county_id	994
16	addr:city	992
17	addr:state	897
18	bus	810
19	shop	692

	Node Name	Tag Count
0	Texas Disposal Systems (MRF)	17
1	Java Noodles	17
4	Conscious Cravings	15
5	McDonald's	15