NoSQL Essentials – Part 6 – Migrations, Polyglot persistence and more

This is the final article on my NoSQL concepts series, before I start to focus on a specific database probably MongoDB. I’ve covered all of the popular NoSQL database types (key-value, document, column-family and graph) as well as given an in-depth overview of NoSQL as a whole. Just to round everything up, I’m writing this article to give some final tips and tricks.

Throughout this series I have been writing about, what I have learnt from reading the NoSQL Distilled book by Martin Fowler and Pramod J. Sadalage. Here I am going to be writing about the final few chapters: Handling schema migration, using polyglot persistence, additional storage engines and how to choose the right database.

Handling schema migration

There are tonnes of tools out there for handling database migration on relational databases, however there isn’t much on the NoSQL front. Luckily building a database migration system is pretty simple. All you need to do is first create a baseline snapshot of the database, then create migration scripts to update the database to the latest schema. Migration scripts should be named incrementally or by time-stamp, so it easy to determine the order in which updates should happen.

By the nature of NoSQL, databases do not have an explicit schema. However they do have an implicit schema which is suggested by the application itself. This means that when making changes to a schema, there is far less resilience than what you would normally expect from a relational database. This doesn’t mean we don’t need to think about the database design at all, we still need to think about how we store aggregates, what we store in column families, relationships we need between entities (graph databases), indexes to create, plus more. The good think is, all of these things are relatively easy to change.

Migration Examples

Add a new property to a key-value or document store – This is as simple as adding the new data to an aggregate in your application before saving. Then editing you application to handle the new data. No database changes are required.

Renaming a property, e.g from fname to fullname. In a relational database, this would involve renaming a column in a table, then editing the application to cater for the change. In a NoSQL database you have to add this new property to every record in the database and remove all the old properties. Depending on the size of you database this can be very costly. Instead of doing this in one big hit, there is another approach to this called incremental migration. This involves saving the new property (fullname), when an aggregate is written to, this also means that you need to edit your application to temporarily handle both fname and fullname. Once a large percentages of aggregates have been updated, you could update the remaining as pert of a batch process, before removing the need for fname in the application logic.

Migrating graph databases – Lets say you want to change the type of an edge between entities. The most common approach to this would be to replicate the edges in the database under a new edge type. Then once you are comfortable with the change and the application has been updated, the old edges can then be dropped. If you want to update the properties on an entity. You would have to loop through each entity and manually update the property.

Changing aggregate structure – Lets say you want to split a customer aggregate into a customer aggregate and multiple order aggregates. First if you have a need for this, then you may want to consider using a relational database instead. If you want to continue using a NoSQL database, you will have to manually read each aggregate you want to split and then save them as separate aggregates. To prevent downtime, it’s a good idea to first edit you application to handle either aggregates and if the split aggregates are found, always write to these to prevent conflicts later on.

Polyglot persistence

Polyglot meaning, to understand many languages, in this case refers to understanding and using multiple persistent databases in a single application. Up until now, you have possibly only ever used relational databases, maybe you have used key-value stores too, however the rest are all still quite new technologies. Learning when to use each database is a core skill of any DBA.

This no longer needs to be the case. Although I haven’t explicitly covered any NoSQL technologies, hopefully from this series you’ll have a much better idea of when to use different NoSQL databases and possibly multiple of them. For further details on each database type, you can read the articles I wrote on key-value store, document databases, column-family databases and graph databases.

Using services as adapter applications for each of your data stores is a great way of creating a friendly universal interface for accessing each of your data sores, across multiple applications. Many data stores provide REST API’s out the box including Riak and Neo4J, so in these cases the responsibility of these services, would simply to make RESTful API calls.

When it’s not possible to change a data storage engine due to large amounts of dependencies, we can make use of an indexing engine like Solr

As much as selecting the correct tool for the job is always advised. You still want to limit the amount of databases used in an application. Making sure all of your environments are correctly configured for each databases type can become problematic, this is known as deployment complexity. Fortunately there are tools like Vagrant and Docker for creating virtual machines to host you environments on, making deployment complexity less of an issue.

Additional data stores

File system – Chances are you already use the file system as a data store. If your application allows users to upload images, audio, video content etc, using the file system for storage, is your best choice. There are multiple distributed file systems available including Hadoop distributed file system, GlusterFS and Googles propriety Google File System (GFS)
Event sourcing – This is the process of persistent events rather than persisting the current state of the application. Events can then be processed to update the application state. The benefit to this is that any any time we could rebuild the application state using the event log. In time, it can become increasingly slow to rebuild a database entirely from events, this can be optimised by storing snapshots of the database and just replaying the latest events. Event sourcing is useful as a kind of version control, if we need to rollback to a previous state.
Memory – This is essentially what key-value stores do, the obvious limitation here is the amount of data you can store in memory. Creating rollback systems for in-memory storage can also be tricky.
Object databases – These have been around for a long time, but have never been very successful due to the tight coupling they have with a specific application. They are also difficult to run migrations on.

Final Considerations

Relational databases have been around for a long time, so there is a huge community behind them and a huge amount of tools to help out. NoSQL technologies are still fairly new and so are lacking this community and the tools. This means you could be spending a lot of time doing trial and error. If you’re under tight deadlines, are you going to have time for the research?

In relational databases we generally create users which have permissions to perform certain actions. In NoSQL databases security isn’t handled by the database, but instead security is the responsibility of the client accessing the database.

At this current time, it’s still very difficult to say which database will be best when starting an application. Your best bet is to short-list the best databases for the application, then create the best prototypes you can in the allocated time you have. Between the team you are working with, you should make a decision using your best judgement. Try to perform complicated actions, so you get a feel for the databases beyond the happy path.

Load test your database choices. They may be able to cope with 10 consecutive users, but what about a million.

Use database services as much as possible. If you make a bad choice when choosing a database, you’ll find it significantly easier swapping to another database. Also by using services you’ll find it easier to introduce NoSQL into an existing application.

Don’t feel like you have to use NoSQL, relational databases are still more often than not the better choice.

0 Love This

Simon Jakowicz

Just another blogger