Exploring a UK Open Government Dataset with Neo4j

In my first job I was working for a company that developed a management information system for UK Police Forces; this system produced the statutory HMIC (Her Majesty’s Inspectorate of Constabulary) reports and allowed OLAP exploration of the datasets loaded into cubes from the data warehouse tables.

One of the areas that I implemented was the key performance indicators for Road Traffic Collisions, so I was intrigued to discover that the fuller, anonymised STATS19 dataset was now available on data.gov.uk. If you’re interested in the STATS19 form you can see it here.

Any data set exploration task generally involves the following stages:

Obtain the data set
Understand the data format / types / constraints
Understand the lookup data ranges
Identify any obvious quality issues
Load the lookup values
Load the data (this usually throws up referential data quality issues)
Write & run experimental queries :)

Neo4j is a mature native graph database that implements the Property Graph model; version 2.0 added type labels and constraints that enable more rigorous data modelling; version 2.1 added the LOAD CSV import capability making it incredibly easy to get data into Neo4j and explore using the built-in browser.

The source data is published by the Department of Transport available under the Open Government OGL2 licence from http://data.gov.uk/dataset/road-accidents-safety-data – for this example I’ve used the “Road Safety – Vehicles 2011″ dataset.

This sample is based on the first 250 rows of data, with lookup data extracted as separate CSV files from Road-Accident-Safety-Data-Guide.xls (plus additional -1 Data missing values added to the Age Band and Propulsion lookup table files).

If you wish to follow along or play with the data, then see this Graph Gist. I originally started the Graph Gist in May 2014 with the intention of writing this blog post, in the intervening period it even got submitted to the GraphGist winter challenge…

Sample model

The following image shows the sample entity relationship model used.

LOAD CSV

Cypher is Neo4j’s declarative graph query language, the LOAD CSV feature allows you to pull data in from a CSV file and address it in subsequent clauses positionally or by a header alias. The following example is used to load one of the sets of lookup data:

LOAD CSV WITH HEADERS FROM "age_band.csv" AS csvLine CREATE (:AgeBand { name:csvLine.label, code:csvLine.code});

Loading a subset of the STATS19 requires a more involved query to establish relationships to lookup data and prevent duplicate nodes:
LOAD CSV FROM "STATS19_250.csv" AS csvLine MATCH (g:Gender),(a:AgeBand),(vt:VehicleType),(p:Propulsion) WHERE g.code = csvLine[15] AND a.code = csvLine[16] AND vt.code = csvLine[3] AND p.code = csvLine[18] MERGE (mf:Manufacturer { name : RTRIM(csvLine[22]) }) MERGE (m:Model { name : csvLine[23] }) MERGE (mf)-[:MAKES]->(m) MERGE (i:Incident { ref : csvLine[0] }) CREATE (v:Vehicle { index : csvLine[2], age : csvLine[19] } ) CREATE (v)-[:MADE_BY]->(mf) CREATE (v)-[:IS_A]->(m) CREATE (v)-[:OF_TYPE]->(vt) CREATE (v)-[:PROPULSION]->(p) CREATE (v)-[:INVOLVED_IN]->(i) CREATE (v)<-[:INVOLVED]-(i) CREATE (v)-[:DRIVER_AGE]->(a) CREATE (v)-[:DRIVER_GENDER]->(g);

Whilst you could remove some of the keywords, they have been included for clarity. Note that the pre-loaded lookup data is MATCHed against the CSV properties and the primary entities are MERGEd as they may already exist from a previously loaded record, lastly the relationships are CREATEd.

Large datasets

If you’re going to be loading a large amount of data you can specify USING PERIODIC COMMIT to commit every 1000 rows (you can also specify the commit interval should you require).

Querying and visualisation

With the data loaded, you are then free to query it and explore the relationships within the data. Visualising a large dataset is tricky and you may encounter some limitations of the Neo4j Browser if you are trying to look at the full dataset as it is more suited for a constrained result set (the featured image above contained 508 nodes). In which case, you might consider a tool such as Gephi, which has a Neo4j plugin (last updated for Neo4j 2.1.3).

The gist has half a dozen sample queries, there are many others that could be performed such as statistics relating to the age of the vehicle involved in accidents etc., the following is the visualisation of a specific incident (query 12 in the gist):

Summary

Since the improvements of version 2.0 and 2.1, specifically labels and LOAD CSV, Neo4j has moved on from being a highly capable graph database to being a very valuable tool in the arsenal available to a data scientist. What’s more, the latest version 2.2 release includes many browser enhancements such as the ability to tailor the graph visualisation (colour, size, property) and you are now able to export to PNG.
As we’ve seen, the ability to rapidly load and dissect a data set is very easy and impressive – well done Neo Technology.

Exploring a UK Open Government Dataset with Neo4j

Sample model

LOAD CSV

Large datasets

Querying and visualisation

Summary

Trending Articles

ESENT データベース USS.jtx で、エラーイベント ID 490、454、489、455 が記録される事象について

Revised GDS Gratuity, Severance Amount and SDBS contribution - Social...

Felony Arrest of Joseph A. White and Heather Coomer-White

the range cannot be deleted (6028) in microsoft word

Practice Sheet of Right form of verbs for HSC Students

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

Name Of Parts Of The Day In hindi And English-List Of Part Of Days In Hindi

PRC MOE SCHOOL TEACHER CHARGED FOR SEXUALLY PENETRATING 12 YEAR-OLD WITH FINGERS

Nahitaji matokeo ya kidato cha nne ya mwaka 1998

Muloraki Au

Moondru Mudichu 27-05-2016 – Polimer tv Serial

Password Reset on SX6036?

Outlook でメールを保存または送信時に...

Throw Back: Samini — Where My Baby Dey (Prod by Kaywa)

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

GTA 5 PPSSPP Zip File Download For Android Mediafire 382 MB

Joshua Pigden from Bristol faces trial over rape and Diazepam...

Chai Status, Funny Tea Quotes in Hindi, चाय पर शायरी

Bhiknur Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers List...