MongoDB – Part 4 – Replication

CAP, the theorem behind MongoDB and most NoSQL databases, states that no distributed system can provide consistency, availability and partition tolerance. In this article I’m going to be writing about what MongoDB offers in the replication sector and how you can utilise replication to customise all aspects of the CAP theorem.

I’ll be covering: how to create replication sets, how replication works in MongoDB, how to configure individual nodes in a set, how to retrieve the status of nodes, how to configure write concern when executing queries and common CAP configurations.

Creating replica sets

Creating replica sets in MongoDB is a pretty simple process, if you already have a running mongod instance, you’ll have to stop it first. Otherwise you can follow these instructions:

# Start each mongo daemon with the --replSet flag
# this flag provides the replica set name, each node must use the same name
mongod --replSet "rs0"

# Connect to the node you want to be the primary
mongo 

# Initialise the replica set
# This will make the current node a primary
rs.initiate()

# Add all secondary nodes to the set, it can take a few seconds for a node to connect to a cluster
# The port is optional and will default to 27017
rs.add("hostname:port")
rs.add("hostname:port")

This is all well and good for development environments, but in production you’re going to want to use some kind of authentication, to prevent rouge nodes from accessing your cluster. MongoDB caters for this using the challenge-response authentication mechanism. It makes the process slightly more complicated, but MongoDB have some fab docs, telling you just what to do.

If you want to remove a node from a replica set, you can use the rs.remove() method.

rs.remove("hostname:port")

How replication work

Replication is the process of making sure data is mirrored across multiple nodes. These nodes could be running on the same machine or completely different ends of the earth. Ideally they’ll be somewhere in the middle of that e.g multiple warehouses and geographical locations. Replication works in MongoDB by linking multiple nodes together using the commands above. Once a collection of nodes are connected, one will be assigned as the primary and the others will be assigned as secondaries.

Each node is set a sync source, this essentially defines which node it should copy changes from. It’s usually a sensible option to have all nodes sync from the primary, however there are some benefits to having secondaries sync from other secondaries. Just be careful not to create large replication chains or even worse a replication loop. Every two seconds, every node in a replica set will ping every other member of the set to make sure they’re all still accessible and the cluster is still fully functional. If a node does not get a response from a ping within 10 seconds, the cluster will mark the node as inaccessible and the remaining nodes will attempt to continue without it. Once the node is accessible again it will automatically rejoin the replica set and start syncing missing changes.

You can update the sync source of a node by connecting to it then executing:

rs.syncFrom(“host:port”)

The magic behind replication is in what MongoDB calls the oplog. The oplog is a capped collection stored inside the local database (local.oplog.rs). Every node will have its own oplog containing it’s most recent changes. When a node pings its sync source every 2 seconds, it checks if there are any new changes to sync and if there are, it will automatically start syncing them. Once a node has synced a bunch of changes, it will add the changes to its own oplog. Then if that node is the sync source for any nodes, they can start syncing too. As I mentioned above, the oplog is a capped collection, what this means is the oplog will only store a certain amount of changes. The amount of changes is completely configurable and it’s recommended to store at least a weeks worth. Disk space is literally cheaper than chips, so you don’t have much to lose from having a huge oplog.

By default, MongoDB will create an oplog with a size equal to 5% of free storage. This can be overridden by starting mongod with the –oplogSize flag and the size in megabytes:

# Create a 1GB oplog
mongod --replSet "rs0" --oplogSize 1024

if you want to adjust the size of your oplog, the oplog collection must be dropped and recreated. Be very careful when doing this, not to lose changes which have not been synced elsewhere.

It’s important to note that in MongoDB only the primary node can receive write requests, however all nodes can be used for reading, regardless it’s almost always best to read from the primary. The sole purpose of replication, should be to prevent data loss in your application, remember only the primary node receives writes, so your secondaries may not necessarily be 100% up to date. If can afford to serve up slightly outdated data to your users, then you can potentially distribute read requests across your secondaries. As much as that probably sounds like a great idea, it’s rarely acceptable to serve up outdated data and there’s a much better way of dealing with scaling in the form of sharding, which will be discussed in a later article.

If you want to allow your secondary nodes to serve read requests, you can execute this command:

rs.slaveOk()

Node configuration

As much as it might look like MongoDB randomly picks a node to make the primary. It’s actually being intelligent and making sure it only allows one of the most up to date nodes to become the primary. This can be configured even further by setting a priority on each member of the set, MongoDB will then pick the node with the highest priority value and that is completely up to date. A member with a priority of 0 can never become the primary.

For a node to become the primary node, it must first request that it become the primary to the other nodes, this happens automatically when the node thinks it’s better suited as the primary than the current primary. All nodes will then vote on whether the node in contention should be elected as the primary or not. If any node vetoes the election, the primary will not be changed. A node will only veto when it thinks it should actually be the primary itself, in this case, it will then make a request itself to become the primary. Rinse, repeat.

The current node config can be retrieved by executing:

rs.config()

This will return some JSON containing a list of replica set members. You can edit the JSON (e.g. to set priority) however you like and reconfigure the nodes using the below options.

# The config variable is essentially the edited version of rs.config()
rs.reconfig(config)

Other useful config values you might want to set, include:

buildIndexes – If the node should create indexes. Nodes with buildIndexes set to false should also have a priority of 0 is it can never become the primary. Disabling buildIndexes can be useful on backup servers.
tags – Tags assigned to the node, these tags can be used with WriteConcern, e.g. write to all node with the tag SSD. More on this soon.
slaveDelay – The number of seconds a secondary should always remain behind its sync source. This is very useful as a contingency against data corruption or destruction.
hidden – If hidden, the node cannot be used for reads. Hidden nodes are useful for backing up databases and for nodes running with a slaveDelay.
arbiterOnly – If true the node is an arbiter. More on this soon.
vote – Should the node be able to vote when a primary node is being elected. This should be boolean. A maximum of 7 nodes in a set are allowed to vote.

Arbiters

Arbiters in MongoDB, as you can probably guess are just like arbiters in the real world, they are used to settle disputes. But what kind of disputes you might ask? When there is a partition between nodes, arbiters help to resolve this by giving the final vote as to who should be the primary, this does mean it’s only worth having an arbiter, if you have an even number of data nodes. The reason for this is that if there is an odd number of data nodes, there will already be a natural winner and adding an arbiter would actually go against you. Arbiters are not data nodes, so cannot accept read requests, it’s common practice to have arbiters running on very low performance boxes or even an application server. The most common use of an arbiter is when there isn’t enough budget, to run 3 data nodes in replication.

Using arbiters

Arbiters nodes are started just like data nodes and are be added to a replica set similarly too. Lets take a look:

# Start a new mongo daemon
mongod --replSet "rs0"

# Connect to the primary and add the new arbiter to the set
rs.addArb("hostname:port")

Tagging replica sets & write concern

Write concern is a concept which says how many nodes and sometimes which nodes in particular a write should be made too. Write concern can be set to any of these values:

1 – Write to the primary only.
0 – No acknowledgement of a successful write is required. So just attempt to write to the primary.
Any other number (/d+) – Make sure the write is successful on the primary and synced to n-1 secondary nodes.
majority – Make sure that at least 50% of voting members have been written too.
<tag set> – Makes sure the specified tagged nodes are written to. Described below.

Creating and using tags

# Add tags to replica set members
conf = rs.config()
conf.members[0].tags = { "datacenter.east": "rack1"  }
conf.members[1].tags = { "datacenter.east": "rack2"  }
conf.members[2].tags = { "datacenter.west": "rack1"  }

# Create a write concern tag called east, which writes data too the 2 nodes tagged with datacenter.east
# Why this key is called getLastErrorModes, is beyond me.
conf.settings = { getLastErrorModes: { east : { "datacenter.east": 2 }}

# Save the new config
rs.reconfig(conf)

# Use the new "east" write concern tag to decide where to write too
db.users.insert( { name: "Simon" }, { writeConcern: { w: "east" } } )

Statuses

There are a bunch of methods you can use to get information on how your replica set is setup and how it is currently performing. Some of the most useful include:

rs.printReplicationInfo() – Get oplog size information on the primary.
rs.printSlaveReplicationInfo() – Get oplog size information on a secondary. This tells you how far the node is behind the primary.
db.isMaster() – Used to see if this is the primary node or a secondary node.

Maintaining nodes in a replica set

Maintaining your nodes in MongoDB can be a tricky task. You have to make sure that when adding, removing or editing nodes, you don’t lose any data. Generally the best way to get around this, is by taking secondary nodes offline, one at a time and making changes. Once all of your secondaries have be adjusted, you can take the primary offline and make the change there. Some useful commands when performing maintenance include:

rs.stepDown(seconds=60) – Make a primary become a secondary for the amount of seconds provided. 60 seconds is the default.
rs.freeze(seconds) – Prevent a secondary from becoming a primary for the amount of seconds provided. Alternatively the secondaries priority could be changed to 0.

Another common approach in MongoDB, is to create a cronjob to put any slow secondary nodes into maintenance mode. Putting a node into maintenance mode, prevents reads being send to the node, allowing it to catch up on syncing changes. To put a node into maintenance mode, you must first connect to it and then execute:

db.adminCommand({ "replSetMaintenance": true });

You can disable maintenance mode again by executing:

db.adminCommand({ "replSetMaintenance" : false });

Common CAP configurations

The common notion with the CAP theorem is, you can only support 2 aspects of CAP at any time. While this isn’t strictly true, there are trade offs you should be aware of.

Consistency Vs latency

This is a very simple trade off, which will help you to understand the consistency vs availability trade off, which we’ll discuss next. The theory is that every time your database performs a write, you have to say which nodes you would like the initial save to replicate too. This could be 1 node or 50 nodes or anything in between, it’s up to you. The trade off is that if you save to just 1 node, you get a very quick response and so latency is low, however consistency is also low. If you have a failure on the single node with changes, you will lose them. Alternatively you could opt to make sure data is replicated to all 50 nodes, this would be very slow, but it would also mean even if 49 nodes failed, you wouldn’t lose any data. The write concern you choose, should largely depend on the requirements of the application.

Consistency Vs Availability

So we have already covered what consistency means, availability on the other hand means, how available the system is, for reads and writes. Think back to the example above, if you have 50 nodes and a write concern of just 1 node. You wont have consistency, but even if 49 nodes went down, your 1 surviving node could be made the primary and continue making writes. On the flip side, if you had a write concern of 50, all of the nodes would be consistent, however if 1 node goes down, the application will no longer be able to perform writes.

Resolving conflicts after a network partition

If there is a network partition and as a result, an out of date secondary becomes the new primary, there will unfortunately be conflicts between the new and old primary, when the network partition is resolved. In this situation the old primaries changes are rolled back. These can then be reapplied manually. Rollbacks can fail if there are over 300MB of data to rollback or over 30 mins of operation time. In this situation, the old primary will need to go through an initial sync again.

Initial Sync

Initial sync is a process which must be run on new and out of date nodes e.g. after a network partition is resolved – as above. Initial sync involves copying all of the data from one node onto a new node. While the sync is happening new changes could also be applied, data is synced again to pull the latest changes from the oplog. Indexes are then added, a final sync of changes from the oplog are then applied. It’s possible for their to be so many changes, that when the 2nd or 3rd sync happens from the oplog, there may have been changes which are not synced but have been removed from the oplog – as it’s a capped collection. This will make the sync fail. This can be overcome by making the oplog bigger or by using a more durable data import solution.

Conclusion

Phew, you exclaim. That was a lot to take in right? and I’m guessing you can’t remember half of it. Replication and MongoDB in general are not easy subjects to digest, for you to really learn this stuff you’re going to have to read this all again and probably again after that. If you didn’t the first time around, I suggest you try to follow along with what I describe. Start by spinning up a couple of mongod instances on you local machine, hook them up together, maybe chuck in an arbiter too. Setup the nodes to sync from and edit the config of your nodes. e.g. priority and tags. Once you have done all of this you can start sending writes to your MongoDB and see how they propagate throughout your nodes. Take nodes offline and monitor how your other nodes will automatically take on the primary role. Then you can start toiling with write concern, see how increasing consistency mean taking longer to perform writes or even completely preventing writes, when you have not enough nodes online. Once you have a firm grasp on how all of these concepts link together to form replication in MongoDB, you’ll be halfway to becoming the rockstar MongoDB user, you’ve always wanted to be. In the next article, I’ll be covering sharding, it’s going to be another big one, so don’t take off those glorious new golden DBA boots just yet.

Thanks for reading!

1 Loves This

Simon Jakowicz

Just another blogger