Creating a punch card of MapLesotho Mappers

So today I decided I would dig into the details of the number of changesets committed by each user to OSM in covering Lesotho. I was reading an interesting post over on the development seed blog on a new command line tool they had developed called osm-meta-tool to extract information from the OSM minutely changeset files.

So I promptly went off and downloaded there code from Github (osm-meta-util) to see could I repurpose it to process the daily diff files for Lesotho provided by geofabrik. I have forked the osm-meta-util project and modified it for use with the Lesotho data. You can get hold of the source at github.com/rustyb/osm-meta-util.

I should point out i got blocked from the geofabrik server for too many requests like an idiot, make sure you set a suitable time delay between requests.

Anyway enough about my sillyness. What tools are we going to use today?

Ok so the first thing was to decide for what days to prepare the simple punchcard. GeoFabrik have been creating daily diffs for Lesotho since the 30th of January this year. I want to have all the days we have so far. To do this we can go to our osm-meta-util folder and run the cmd.js example with some switches:

node examples/cmd.js 1 49 10000 | jq -c '{user:.user, changeset: .changeset, version: .version, timestamp: .timestamp}' > lesotho.json

This query will then go and stream all the .osc.gz files starting with 001.osc.gz (30/01/15) and finishing with 049.osc.gz (19/03/15). It will wait 10 second between each file request so we do not overload the server and get blocked as I did.

The contents of a .osc.gz file look like the following. Between create, modify and delete tags are the nodes or ways which have been created, modified or deleted. In some cases as with the one below tag information will be present between the <node></node> tags. I am only interested in getting the username, the timestamp and the changeset number.

<modify>
  <node id="2946016397" version="2" timestamp="2015-03-16T21:56:17Z" uid="1932826" user="DeBigC" changeset="29528496" lat="-30.481495" lon="27.6335444">
    <tag k="ford" v="yes"/>
</node>

What happens when we run the previous command is that each XML file is read all the lines which represent a node or way are then extracted and converted into JSON and piped to jq where the attributes we want are extracted.

We now have a the JSON file lesotho.json which contains all of our JSON objects. However you’ll notice that this is not valid json as our objects need to be contained in a JSON array.

{"user":"bjoernchr74","changeset":"6837525","version":"1","timestamp":"2015-02-14T01:35:09Z"}
{"user":"tshedy","changeset":"28823174","version":"2","timestamp":"2015-02-13T15:10:28Z"}
{"user":"tshedy","changeset":"28823174","version":"2","timestamp":"2015-02-13T15:10:28Z"}
{"user":"tshedy","changeset":"28823174","version":"2","timestamp":"2015-02-13T15:10:28Z"}
{"user":"Moseme","changeset":"8785233","version":"1","timestamp":"2015-02-14T01:35:09Z"}

To fix this we will first use awk to add commas to the end of all our lines.

awk '{print $0","}' lesotho.json > lesotho.json

Next we need to open the file and add a [ to the first line of the file.

Then at the last line we can delete the very last , and then add at ] to close our array. I really must find a more programatic way of doing this. Suggestions welcome!

Pandas Time

Now that we have some valide JSON we can then import this into pandas using df.read_json()

import pandas as pd
import numpy as np

les = pd.read_json('lesotho.json') # import the json file to a dataframe
les.timestamp = pd.to_datetime(les['timestamp']) # convert timestamp to date time index
les.set_index(les.timestamp, inplace=True) # make time the index

You can see above that we have setup a time index a very powerful feature of pandas that will allow us to group all our results by the days of the week as we desire to create punch card for each user. Next we add a new column to the database called day which contains an integer for the weekday 0=Monday and 6=Sunday.

>>> les['day'] = les.index.weekday # add weekday column with integer for day of the week
>>> les.head()
                     changeset           timestamp         user  version  day
timestamp                                                                    
2015-02-14 01:35:09    6837525 2015-02-14 01:35:09  bjoernchr74        1    5
2015-02-13 15:10:28   28823174 2015-02-13 15:10:28       tshedy        2    4
2015-02-13 15:10:28   28823174 2015-02-13 15:10:28       tshedy        2    4
2015-02-13 15:10:28   28823174 2015-02-13 15:10:28       tshedy        2    4
2015-02-14 01:35:09    8785233 2015-02-14 01:35:09       Moseme        1    5

Next we will group our results by the weekday first and then the user, which will then allow us to count the number of unique change sets made by each user. We make use of pandas .groupby() function here:

res = les.groupby(['day', 'user']).changeset.nunique().unstack()

res.ix[:,0:5].head()
user  4rch  AE35  Adrian Frith  Alecs01  Alex24223
day                                               
0      NaN    11           NaN      NaN        NaN
1      NaN     4           NaN      NaN          1
2      NaN     1             1      NaN        NaN
3        9     3           NaN      NaN        NaN
4      NaN     1           NaN      NaN        NaN

We can see now we have a dataframe where the days are on the X-Axis meanwhile the usernames are on the y axis. It would look better and be easier to interpret if we have the names on the x-axis and the days as the columns. .unstack turns our multi-index list returned by .groupby into a table which we can easily interpret.

Luckily this is really easy with pandas we can use the .transpose() to flip our matrix.

res1 = res.transpose()

Also it would look better to have mon-sun instead of 0-6.

res1.columns = ['mon', 'tue', 'wed', 'thu', 'fri', 'sat', 'sun']

Now finally we would like a grand total of changesets for each user. This again is easy we use .sum() and add a new column total. The use of axis=1 is telling sum to use rows rather than columns (axis=0)

res1['total'] = res1.sum(axis=1)

res1.head()
              mon  tue  wed  thu  fri  sat  sun  total
user                                                  
4rch          NaN  NaN  NaN    9  NaN  NaN  NaN      9
AE35           11    4    1    3    1    1    1     22
Adrian Frith  NaN  NaN    1  NaN  NaN    1  NaN      2
Alecs01       NaN  NaN  NaN  NaN  NaN    1  NaN      1
Alex24223     NaN    1  NaN  NaN  NaN  NaN  NaN      1

Now we can do some nice things like quickly find out what are the total number of changesets on each day of the week. As you can see Monday is the day with the most changesets since January.

res1.count().order(ascending=False)

total    207
mon      110
tue       79
sat       60
wed       52
fri       49
sun       46
thu       43
dtype: int64

However if we look at the mean we can see that sunday in fact has a higher average number of changes than the other days.

res1.mean().order(ascending=False)

total    14.497585
sun       8.804348
fri       7.938776
sat       7.133333
tue       6.822785
mon       6.427273
wed       5.961538
thu       5.186047
dtype: float64

Now we will export this as a csv file so that we can visualize the users punchcards for each day of the week. This is another simple action. simply use .to_csv to do so.

res1.reset_index().to_csv('les_data.csv', encoding='utf-8', index=False, na_rep=0)

The csv file containing all of the counts is available from github - LesothoUserChanges.csv

The Lesotho OSM PunchCard

A quick seach on google came up with a simple d3.js script called D3.js Punch Card which will produce a punch card just like those in your github repositories.

View Fullscreen

This still needs to be improved to sort out the total column to show a number all the time rather than another circle.

So there you have it an extremely long winded way of make a punchcard of the OSM users in your country.