NoSQL Essentials – Part 3 – Document databases

Document Databases

Document databases are debatably the most talked about NoSQL database type at the time of writing. They’re easy to understand and really easy to work with. They offer all the benefits of using a key-value database (nearly), plus a lot more. Some of the most popular key-value databases currently includes, MongoDB, CouchDB, Terrastore, OrientDB and RavenDB. In this article I am going to be focusing on MongoDB.

Some people love Mongo, some people hate it, I’ve only played with it, but given it’s the most widely used NoSQL database and the most valued by employers. It’s certainly worth knowing a little about.

Differences from relational databases.

Relational databases contain tables, which in turn contain rows of data. Document databases contain collections and collections contain documents. Documents are stored in the form of JSON, well technically BSON (Binary JSON).
Document databases are aggregate oriented like key-value databases. This means that a document might contain a user record, with everything related to the user inside a single document e.g., recent history of a user; recent transactions; a list of friends. In a relational database these additional entities would be stored in separate tables and linked to via foreign keys.
Document databases are essentially key-values stores, but where the value part is fully examinable during reads. Meaning if you only wanted to find users which have a first name of Sammi, this would be possible. This is not possible in key-value stores.
In relational databases, each table has a schema which its rows must follow. In a MongoDB collection, each document does not have to follow a schema of any kind. It can store whatever it wants. New data can be added at any time, without changes to the collection being required. This provides the benefit, of not having an to store endless null values.

Naming conventions between MySQL and MongoDB

MySQL	MongoDB
Database Instance or MySQL instance	MongoDB instance
Database	Database
Table	Collection
Row	Document
Join	DBRef

Useful tips

One of the main goals of a document database, is to improve availability by replicating data across nodes.
MongoDB has a special field which it calls _id that is available in each document, this represents the document id. These id’s are automatically generated.
Each MongoDB instance can have multiple databases and each databases can have multiple collections.
When a document is created it must be set into a database and collection.
Your application does not need to know the state of which nodes are up and which are down, including the primary node. MongoDB can automatically elect a new primary if the old primary goes down. You can also set priorities on nodes, allowing nodes with the highest priority to take over as primary. This allows you to set your most beefy machines as primary.
When a failed primary node comes back online it automatically joins back to the cluster as a secondary, missing data will then be replicated to it.
When new nodes are added to a cluster, data will straight away start syncing to it, once the sync is complete, the node will be ready to serve read requests. This process does not require a restart of any nodes or result in any downtime.

Advantages

MongoDB conveniently provides a comprehensive solution to replication and read and write quorums.
Sharding of documents between nodes can easily be implemented for all document databases including MongoDB. Sharding is the number one option for scaling. You can think of sharding like partitioning in relational databases. If over time, new shards are created, the data will automatically move between shards to rebalance the load. One popular approach is to have different shards for different geographical locations.
The write quorum, takes a convention over configuration approach, where you can set the quorum for a particular database and then override this value on a specific collection or for just one specific write. You can also use keywords like “majority” to make sure at least 50% of nodes receive the write before a write it considered as successful. Inevitably, the more writes that are required, the more performance will suffer. This is known as the trade-off between consistency and latency. This process is known as WriteConcern.
In MongoDB you can set whether it’s acceptable to read from a secondary node by setting a slaveOk parameter. This can be set on the MongoDB instance, database, collection or on an individual document/operation. This allows you to only read from the primary where accuracy has to be 100% correct and read from secondaries when data is not required to be 100% up to date.
Materialised views can be created against document queries that require a lot of juice. These are similar to MySQL views.
MongoDB queries are far simpler to write than traditional relational database queries. This is because multiple entities are embedded inside a single document.
MongoDB clients are widely supported across all popular programming languages, including; PHP, JS, Python, Ruby, Java, C

Disadvantages

Just like key-value store, document databases do not support ACID transactions. There is also a document database called RavenDB which supports ACID transactions out the box. I’ve not tried this yet. However as I do I’ll post my findings.

Use Cases

Real-time stats – Document internals can be updated at any time and materialised views can also be created and updated at any time. This gives great flexibility in keeping real-time stats up to date.
Gamification – A gamification system will allow rewards to be created from multiple implicit actions, this fits well with the schema-less model.
Event logging – Throughout an application you may have millions of different events, each with their only special values. It would be crazy to create a table in a relational database with hundred of columns to cater for all these scenarios. Most column would end up as null, in the past this data may be stored in a single meta data column as JSON or something similar. This does the job for writing, however it makes it impossible to execute read queries based on this data, at least without using extremely slow regex based searching. Schema-less documents to the rescue.
Content Management – Data inside content management system are usually easy to split into aggregates, due to there being a limited number of relationships. Allowing you to easy reap the benefits of a document database.

0 Love This

Simon Jakowicz

Just another blogger