So far I’ve covered key-value, document and column-family databases. All that’s left now are graph databases. However surprisingly, they’re nothing like what I’ve covered so far. They’re used in completely different scenarios to other NoSQL stores and are also structured like no other database out there.
When thinking about graph databases you can almost forget about everything you know about NoSQL. You can forget about tables, columns, rows, aggregates. Instead you need to know about entities aka nodes, edges aka relationships and properties. Common graph databases include Neo4J, Infinite Graph, OrientDB, FlockDB. In this article I am going to be focusing on the most popular of the bunch, Neo4J.
Graph database are commonly used in the social scene. Their main goal is to map huge amounts of data and to find patterns in data. For example this could be to find people you may know based on who your friends know or to finds bands you might like, based on who your friends like.
Graph database glossary
- Entities – Think of these like objects in your favourite OO language or rows in a table. Entities can contain properties like name or description.
- Edges – These are relationships between two entities. Edges can be bi-directional, or have directional significance. Edges can also contain properties. Edge properties are extremely useful for creating intelligent queries.
- Properties – These are like properties, they are mainly used to give us extra data about an entity or the relationship between two entities.
Differences from relational databases.
- There are no such things as tables, columns or rows. Instead we have a database which contains a huge amount of individual entities and these entities are joined together using edges.
Differences from other NoSQL databases.
- Most graph database are not built to run on a cluster. Neo4J does support being run on a cluster and is similar to how document database work, using master-slave replication.
- ACID transactions are fully supported by graph databases.
Differences from relational databases.
|Database Instance or MySQL instance||n/a|
|Row||Entities that contain properties|
|Join between tables||Edges between entities|
- It is common to assign a property named “since” to an edge, this is useful to state when the relationship was created.
- Relationships have directional significance. If you want to join two users together under a friend relationship. The relationship would have to be created both ways. An example of singular direction relationship would be: a user likes a comment, but the comment doesn’t like the user.
- Extremely complicated queries can be run against your entities e.g. Get all users who own a CD, get all users who have a friend called steven, get all books which have over 500 likes, get all users who are friends with at least 3 of my friends etc. You can literally query for anything…. and quick. These types of queries are known as traversing the graph.
- Edges between entities are persisted to the graph, this means that writes are slightly slower, but this allows for the lightning fast reads.
- Entities cannot be deleted when an edge between them still exists.
- Graph databases all generally support a language called Gremlin for traversing graphs.
- Often there are multiple paths between two entities. It’s possible to select all paths or just the shortest path. It’s also possible to use different algorithms to find the shortest path.
- Like all NoSQL databases, sharding is possible and is a good technique for write scaling. Be aware that graph databases are NOT aggregate oriented and so sharding can also cause issues. A common technique is storing entities on nodes based on the geographical location of the data e.g. the customer.
- Any write operation that is performed without first starting a transaction, will cause an exception to be thrown.
- High availability is achieved by providing replicated slaves. Data can be written to slave, however data will also straight away be written to the master node. Data will then be replicated to the other slave using under the eventual consistency model.
- Apache ZooKeeper is used to keep track of transactions and then uses this data to mirror transactions across nodes.
- When the master node goes down, Neo4J automatically elects a new master.
- Neo4J has its our query language called Cypher for traversing graphs. The Cypher query languages also supports methods to ORDER, AGGREGATE, SKIP and LIMIT.
- Indexes can be created against properties of an entity or edge. These indexes can then be added to incrementally or added to as part of a batch process. Neo4J used Lucene as its indexing service.
- Neo4J can run entirely in memory, there are also graph databases built specifically to run in memory like imGraph. This is a great way to optimise performance even further. As long as you have enough memory.
- In a relational database, these graph traversal type queries could take huge amounts of time to run, depending on how many table joins are required. In a graph database they can be instantaneous.
- Adding relationships in a relational database can involve schema changes and slow data movement which is far from ideal. In a graph database, adding relationships is as simple, as adding an edge between two entities.
- Creating a graph is as simple as creating two entities and a relationship between them.
- ACID transactions are fully supported.
- Social networking – Graph databases can dive numerous levels deep into relationships, meaning if you want to suggest friends to people, based on who your friends know. This is easily do-able.
- Recommendation engines – Lets takes Spotify as an example. It recommends bands to its users, that they might like. This is done my finding bands, which people with similar listening habits also like. Another common example is suggesting items to buy on an e-commerce site.