Scanning CompanyHouse.gov.uk with Neo4j

It is now almost a year since I worked on this very interesting graph analytics project, and even though I wanted to blog about it sooner I never got the chance to. Luckily, I finally did, while at the recent GraphConnect conference, which takes place annually in London at QE II centre. Allow me a short parenthesis here: Neo Technology, as always, put a lot of effort in making sure that the conference is a memorable experience for all attendees. Neo4j 3.2.0 was released, with a variety of new features and, finally, the company reports that scalability issues are being addressed (or so the product engineers claimit remains to be tested!). The atmosphere was more than inspiring, and so, between following some interesting talks and having stimulating discussions with other attendees during breaks, I was able to squeeze some time to share my experience. 

So what is it all about?

CompanyHouse is the United Kingdom's registrar of companies. All forms of UK companies are obligated to be incorporated and registered with CompanyHouse and file specific details as required by the current Companies Act 2006, which are digitally recorded.

CompanyHouse is a member of the Public Data Group, which was formed in 2011 to improve the amount and quality of data publicly released in order to make more data available, with the objective of increasing economic activity. Thus, CompanyHouse data are publicly available through their website while, as of 2016, data is also available through a RESTful API.

Obviously, we are talking about a very rich dataset which can be analysed in a multitude of ways. One of those ways is through graph analytics; but how can one construct a graph of companies using these data?

Constructing Company Graphs

One of the requirements that companies need to satisfy when registering with CompanyHouse is submitting the full list of their board of directors. Since directors can participate in more than one companies, it makes sense to link companies based on whether they share the same directors. And there you have it: a graph is born.

Our objective was simple:

  • get access to the data;
  • construct a graph out of it, and;
  • apply a number of different graph analytics to see whether we could extract additional insights from the graph structure.

At the moment, the full CompanyHouse database contains more than ~10.5 million companies (nodes), more than ~11.5m directors (nodes) linked between them with closely ~20m edges. Rather than build the whole graph straight away we began with a small POC project, intended to simply investigate how easy it would be to generate such graphs when focussing on a single company as a starting point, and whether they would be meaningful at all. As one would expect, neo4j naturally lends itself to this kind of analysis.

Using the CompanyHouse API

To use the CompanyHouse API you will need to generate access credentials. These are generally composed of a username and password but in this case one only needs a username. Specifically, though this is not listed in the CompanyHouse documentation, requests made using the Python requests library cannot omit the the password section but need to replace it with an empty character as shown below:

username = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
password = ''
response = requests.get(url=call, auth=HTTPBasicAuth(username, password))

The package

Constructing a neighbourhood using Breadth First Search.

Constructing a neighbourhood using Breadth First Search.

You can find the package for the POC described in this article, on our open git repo here. The package interacts with the CompanyHouse database through its dedicated API and what it does is very simple. Being a RESTful service, each record (company, director) of information is assigned a globally unique URI (simply put, a company/director id), and the operations available on each resource are directly mapped to HTTP verbs. Data can, therefore, be easily consumed by issuing simple GET requests on the required resource URI. Thus, the code takes as input a company id and issues such a request with the objective of collecting the list of directors for that company (along with other company attributes). In turn, similar GET requests are issued for each and every one of the returned directors for which a corresponding list of companies is returned. This process is iterated for as many cycles as the user wants, effectively performing a BFS around the input company (source) and collecting data to create that company's neighbourhood of connections. Specifically, the process goes through the followings steps:

  1. Get the information for a company A.
  2. Create a company node for that company.
  3. Get all the officers associated with company A.
  4. For each officer get the list of companies (N) they currently have active roles at.
  5. Create a company's 1-hop neighbourhood by concatenating all Ns collected for each officer in company A, having filtered out duplicates as well as company A itself if found. Lets call this set: M
  6. For every company in M repeat steps 1 to 6.
  7. End data collection if:
    1. Maximum number of hops has been reached.
    2. No more new neighbouring companies can be found.

Note that duplicates may appear both in the list of directors or the list of companies connected. This is effectively a loop in our search. As this is not a large scale application, I am resolving this issue by maintaining an in-memory list of the traversed companies/directors. However, as I will explain later, this turned out to be inadequate as far as directors are concerned.

The Package Modules

The package contains 3 modules: module_company.py, module_officer.py and module_neo4j.py the latter being the most interesting amongst them. The company module is where the company profile is saved, while it also includes the routine for collecting a company's active officers (directors). The officer module is where the officer profile is saved while it also includes a routine for collecting an officer's active participation in other companies

The neo4j module is responsible for instantiating nodes and relationships between them as well as loading everything into Neo4j. It’s worth noting that the first time this code is run you will need to uncomment the code under CREATE UNIQUENESS CONSTRAINTS which needs to only be executed once.

graph.schema.create_uniqueness_constraint("Officer", "id")
graph.schema.create_uniqueness_constraint('Company', 'id')

This is a py2neo driver issue that hasn’t been addressed yet and thus needs to be dealt with manually.

Once data are collected, ingesting them into neo4j is very straightforward. All you need is an active neo4j session while the code is ran. Then, for creating company nodes you can just call:

def create_company_node(company):

    node = Node("Company",
                id=company.id,
                name=company.name,
                type=company.type,
                status=company.status,
                effective_from=company.effective_from[2],
                postal_code=company.postal_code,
                jurisdiction=company.jurisdiction,
                sic_codes=company.sic_codes)

    tx = graph.begin()
    tx.create(node)
    tx.commit()
    return node

Respectively for officer nodes: 

def create_officer_node(officer):

    node = Node("Officer",
                id=officer.id,
                name=officer.name,
                DoB=officer.DoB[1],
                nationality=officer.nationality,
                CoR=officer.CoR)

    tx = graph.begin()
    tx.create(node)
    tx.commit()
    return node

and, finally, for creating a relationship:

def create_relationship(officer_node, company_node, officer, company):

        relation = Relationship(officer_node,
                                officer.roles[company.id],
                                company_node,
                            active=officer.active_roles[company.id])
        tx = graph.begin()
        tx.merge(relation)
        tx.commit()

Note that, additional precautions are in place throughout the package to manage rate violation exceptions and missing page results.

An example run

Below you can see an example of the Reply Ltd network (5-hops) which can be constructed in under 5 minutes. 
 

The 5-hop network of Reply Ltd.

The 5-hop network of Reply Ltd.

Gaps to be addressed

One of the main issues that needs to be addressed (although it is already partly addressed in the code) is that of duplicate entries. It seems like a single director with multiple placements might appear in the database with many unique ids but be the same person. In addition, some directors register themselves with what maybe minor typos in their names or added hyphens etc.This makes it even more difficult to link people based on just having the same name.

At the time that this code was implemented (11/08/2016), I chose to instantiate distinct nodes for different names (i.e. merge nodes with just the same name, since in a small network of companies it is highly unlikely that two entities with the same name will concern distinct people) and resort to manually merging nodes if it is still necessary.

That's it...

I really hope you liked this blog post. If you did, clone the code and run it yourselves. Ideally, you can even fork it and commit your own improvements/features. Until next time... ta.