These words couldn’t be stressed enough during Emil Eifrem’s (Co-Founder of Neo4j project and CEO of Neo Technology) opening keynote speech at GraphConnect this April. And in all honesty, I couldn’t agree more. From social networks to the internet, to how diseases spread or fraud detection, to semantic ontologies or genome sequence graphs, to road networks or the neural networks in our very own brain; everywhere we look, there exist hidden or explicit relationships between all kinds of entities. Everything is somehow interconnected.
For many companies identifying these connections, and leveraging them to gain additional insights, appears to be a big game changer (with Google’s use of the PageRank algorithm being one of the most original examples). In fact, realising this, more and more companies are shifting their perspective, looking at their data from a different angle. This year’s GraphConnect was, at the very least, a testament to this shift.
The conference was structured around Neo Technology’s Graph Database Management System, Neo4j, the new version of which was released the same day. As such, everyone wanted to find out more about the new features, software integrations and upgrades the new version comes with, as well as about use-cases to inspire how one can best utilise its full potential.
Luckily enough, a training day preceded the conference on Monday the 25th at SkillsMatter, allowing conference attendees an early peek into the new version...
Day 1: Training with Neo4j
Four different sessions were offered on the day:
- Neo4j Fundamentals,
- Graph Data Modelling with Neo4j,
- Advanced Neo4j Deployment, and,
- Build a Recommendation Engine with Neo4j.
From a data science perspective, the last one seemed the most interesting - so it's the one I opted for.
As soon as we arrived we were provided with Neo4j v3.0 to use for the training. At first sight, at least as far as the interface is concerned, everything seemed comfortably familiar. After playing a bit with it I discovered that one can now better interact with the visualisations, by locking/unlocking the node position in a graph, removing nodes, and expanding a node’s child relationships, which I found to be quite convenient.
Though I will be elaborating more on Neo4j’s new features later in this article, a detailed list of all the changes that come with the new version can be found here. However, on a side note and as far as details go, do note that importing files is now done through a new folder that comes with the new distribution—reasonably named “import”. Knowing this beforehand will save you at least ~15 minutes of vainly trying to load a csv using the full file-path as you used to (trust me).
In any case, the main idea was to use Cypher queries in order to extract useful information in a direct way, so as to make useful recommendations to users of Meetup (an online community building platform). For the purpose of the course, the real Meetup database was provided and loaded into Neo4j, following—what else—a graph database model. Members of groups are interested in topics and attend events, which take place at venues. Groups have different topics.
In this context, recommendations were based on a very intuitive philosophy which, essentially, relies on measuring a node’s participation in a sub-graph of interest. For example, in order to recommend new groups to members one would rely on a score computed in the following way:
- For every topic a member was interested in, count the number of times the topic appears across the groups that the user is currently in. For example, if one has interest in a topic “A” and “A” appears in 12 groups the user participates in, the score for that topic would be 12.
- Then, find groups the member is currently not participating in and assign a score to each group by summing the scores given to each topic that the member is interested in and is also associated with that group.
All these you could do with ease, utilising the intuitive syntax of Cypher queries.
Though a simplistic approach, one cannot but appreciate its effectiveness as well as its efficiency—given the limitations of using a declarative language—in directly providing intuitive choices that a member is very likely to find interesting. Similar approaches were employed later on but in relation to other objectives, such us using queries to compute the distances of various venues from a certain location for booking recommendations, or suggesting users one should 'friend' with, based on whether they have RSVP-ed a relatively large number of the same meetups.
Obviously, for more complicated recommendation approaches, involving supervised or unsupervised learning techniques which require the use of iterative queries, e.g. random walks, one has to resort to using the Neo4j REST API. This provides a way to query the database with Cypher - nevertheless, and to our surprise, this is not so much the case anymore.
Enter the scene: procedures. Procedures are a new feature in Neo4j which allow one to write custom code that can be invoked directly from Cypher. A number of basic procedures come out of the box with the new distribution, allowing, amongst other, the use of query-like commands for identifying the node labels, property keys, relationship types or active constraints in a database. But these only scratch the surface of what one can do with them.
At the heart of this initiative is Neo Technology’s Michael Hunger who has already created a library named apoc enriched with procedures for loading/exporting data in JSON format, loading data from Web-APIs, returning Virtual Nodes that add to the graph’s visual expressiveness, Job Management, conversion between formatted dates and timestamps and even various Graph Algorithms and many more. As a matter of fact, Michael committed to delivering 100 distinct procedures by the release date of v3 and deliver he did. The potential of this new endeavour is immense since anyone can contribute to the development of a procedure. If you are interested in doing so you can start here.
A rather interesting procedure I came across after browsing a bit at the end of the training is for the SLM clustering algorithm, an algorithm used to perform clustering in very large networks with millions of nodes and edges, implemented by Mark Needham which you can find here.
Overall, I left the training session with lots of things to digest, an elevated sense of curiosity looking forward to the conference day, eager to find out more and contribute to this ‘procedures initiative’ which I am sure will be welcomed by the Neo4j community. All in all the training event on Monday was very well organised, and set the stage for the main event the following day.
I'll review the conference in Part 2 of this blog post as soon as I get the chance…