Seven Databases in Seven Weeks, 2nd Edition

Wrap-Up

Neo4j is a top open source implementation of the (relatively rare) class of graph databases. Graph databases focus on the relationships between data, rather than the commonalities among values. Modeling graph data is simple. You just create nodes and relationships between them and optionally hang key-value pairs from them. Querying is as easy as declaring how to walk the graph from a starting node.

Neo4j’s Strengths

Neo4j is one of the finest examples of open source graph databases. Graph databases are perfect for unstructured data, in many ways even more so than document databases. Not only is Neo4j typeless and schemaless, but it puts no constraints on how data is related. It is, in the best sense, a free-for-all. Currently, Neo4j can support 34.4 billion nodes and 34.4 billion relationships, which is more than enough for most use cases. (Neo4j could hold more than 15 nodes for each of Facebook’s 2.2 billion users in a single graph.)

The Neo4j distributions provide several tools for fast lookups with Lucene, the Cypher querying language, and the REST interface. Beyond ease of use, Neo4j is fast. Unlike join operations in relational databases or map-reduce operations in other databases, graph traversals are constant time. Like data is only a node step away, rather than joining values in bulk and filtering the desired results, which is how most of the databases we’ve seen operate. It doesn’t matter how large the graph becomes; moving from node A to node B is always one step if they share a relationship. Finally, the Enterprise edition provides for highly available and high read-traffic sites by way of Neo4j HA.

Neo4j’s Weaknesses

Neo4j does have a few shortcomings. We found its choice of nomenclature (node rather than vertex and relationship rather than edge) to add complexity when communicating. Although HA is excellent at replication, it can only replicate a full graph to other servers. It cannot currently shard subgraphs, which still places a limit on graph size (though, to be fair, that limit measures in the tens of billions). Finally, if you are looking for a business-friendly open source license (like MIT), Neo4j may not be for you. Although the Community edition (everything we used in the first two days) is GPL, you’ll probably need to purchase a license if you want to run a production environment using the Enterprise tools (which includes HA and backups).

Neo4j on CAP

The term “high availability cluster" should be enough to give away Neo4j’s strategy. Neo4j HA is available and partition tolerant (AP). Each slave will return only what it currently has, which may be out of sync with the master node temporarily. Although you can reduce the update latency by increasing a slave’s pull interval, it’s still technically eventually consistent. This is why Neo4j HA is recommended for read-mostly requirements.

Parting Thoughts

Neo4j’s simplicity can be off-putting if you’re not used to modeling graph data. It provides a powerful open-source API with years of production use and yet it hasn’t gotten the same traction as other databases in this book. We chalk this up to lack of knowledge because graph databases mesh so naturally with how humans tend to conceptualize data. We imagine our families as trees, or our friends as graphs; most of us don’t imagine personal relationships as self-referential datatypes. For certain classes of problems, such as social networks, Neo4j is an obvious choice. But you should give it some serious consideration for non-obvious problems as well—it just may surprise you how powerful and easy it is.

Footnotes

[38]: http://localhost:7474/browser/
[39]: http://neo4j.com/docs/developer-manual/current/cypher/functions/
[40]: https://neo4j.com/blog/graph-search-algorithm-basics/
[41]: http://neo4j.com/docs/developer-manual/3.0/http-api/#http-api-transactional
[42]: http://example-data.neo4j.org/files/cineasts_12k_movies_50k_actors_2.1.6.zip
[43]: http://neo4j.org/download/

Previous Chapter

Day 3: Distributed High Availability

Next Chapter

7. DynamoDB

Table of Contents for Seven Databases in Seven Weeks, 2nd Edition