In my first job I was working for a company that developed a management information system for UK Police Forces; this system produced the statutory HMIC (Her Majesty’s Inspectorate of Constabulary) reports and allowed OLAP exploration of the datasets loaded into cubes from the data warehouse tables.
One of the areas that I implemented was the key performance indicators for Road Traffic Collisions, so I was intrigued to discover that the fuller, anonymised STATS19 dataset was now available on data.gov.uk. If you’re interested in the STATS19 form you can see it here.
Any data set exploration task generally involves the following stages:
- Obtain the data set
- Understand the data format / types / constraints
- Understand the lookup data ranges
- Identify any obvious quality issues
- Load the lookup values
- Load the data (this usually throws up referential data quality issues)
- Write & run experimental queries :)
Neo4j is a mature native graph database that implements the Property Graph model; version 2.0 added type labels and constraints that enable more rigorous data modelling; version 2.1 added the LOAD CSV import capability making it incredibly easy to get data into Neo4j and explore using the built-in browser.
The source data is published by the Department of Transport available under the Open Government OGL2 licence from http://data.gov.uk/dataset/road-accidents-safety-data – for this example I’ve used the “Road Safety – Vehicles 2011″ dataset.
This sample is based on the first 250 rows of data, with lookup data extracted as separate CSV files from Road-Accident-Safety-Data-Guide.xls (plus additional -1 Data missing
values added to the Age Band and Propulsion lookup table files).
If you wish to follow along or play with the data, then see this Graph Gist. I originally started the Graph Gist in May 2014 with the intention of writing this blog post, in the intervening period it even got submitted to the GraphGist winter challenge…
Sample model
The following image shows the sample entity relationship model used.
LOAD CSV
Cypher is Neo4j’s declarative graph query language, the LOAD CSV
feature allows you to pull data in from a CSV file and address it in subsequent clauses positionally or by a header alias. The following example is used to load one of the sets of lookup data:
LOAD CSV WITH HEADERS FROM "age_band.csv" AS csvLine
CREATE (:AgeBand { name:csvLine.label, code:csvLine.code});
Loading a subset of the STATS19 requires a more involved query to establish relationships to lookup data and prevent duplicate nodes:
LOAD CSV FROM "STATS19_250.csv" AS csvLine
MATCH (g:Gender),(a:AgeBand),(vt:VehicleType),(p:Propulsion)
WHERE g.code = csvLine[15]
AND a.code = csvLine[16]
AND vt.code = csvLine[3]
AND p.code = csvLine[18]
MERGE (mf:Manufacturer { name : RTRIM(csvLine[22]) })
MERGE (m:Model { name : csvLine[23] })
MERGE (mf)-[:MAKES]->(m)
MERGE (i:Incident { ref : csvLine[0] })
CREATE (v:Vehicle { index : csvLine[2], age : csvLine[19] } )
CREATE (v)-[:MADE_BY]->(mf)
CREATE (v)-[:IS_A]->(m)
CREATE (v)-[:OF_TYPE]->(vt)
CREATE (v)-[:PROPULSION]->(p)
CREATE (v)-[:INVOLVED_IN]->(i)
CREATE (v)<-[:INVOLVED]-(i)
CREATE (v)-[:DRIVER_AGE]->(a)
CREATE (v)-[:DRIVER_GENDER]->(g);
Whilst you could remove some of the keywords, they have been included for clarity. Note that the pre-loaded lookup data is MATCHed against the CSV properties and the primary entities are MERGEd as they may already exist from a previously loaded record, lastly the relationships are CREATEd.
Large datasets
If you’re going to be loading a large amount of data you can specify USING PERIODIC COMMIT
to commit every 1000 rows (you can also specify the commit interval should you require).
Querying and visualisation
With the data loaded, you are then free to query it and explore the relationships within the data. Visualising a large dataset is tricky and you may encounter some limitations of the Neo4j Browser if you are trying to look at the full dataset as it is more suited for a constrained result set (the featured image above contained 508 nodes). In which case, you might consider a tool such as Gephi, which has a Neo4j plugin (last updated for Neo4j 2.1.3).
The gist has half a dozen sample queries, there are many others that could be performed such as statistics relating to the age of the vehicle involved in accidents etc., the following is the visualisation of a specific incident (query 12 in the gist):
Summary
Since the improvements of version 2.0 and 2.1, specifically labels and LOAD CSV, Neo4j has moved on from being a highly capable graph database to being a very valuable tool in the arsenal available to a data scientist. What’s more, the latest version 2.2 release includes many browser enhancements such as the ability to tailor the graph visualisation (colour, size, property) and you are now able to export to PNG.
As we’ve seen, the ability to rapidly load and dissect a data set is very easy and impressive – well done Neo Technology.
