Column-family databases aren’t a million miles different from document databases. They hold most of the same pros and cons. There is however one significant difference, column-family stores use peer-to-peer replication, where document databases, use master-slave or primary-secondary replication.
Just like all NoSQL database types, you have a lot of databases choices. Common column-family data stores include HBase, Cassandra, Amazon SimpleDB and Hypertable. HBase was built by Apache and is debatably the most popular column-family store. However recently Cassandra has been getting a lot of attention, so that is what I will be covering.
Differences from relational databases.
- In column-family databases, each row consist of a collection of columns=>value pairs. A collection of similar rows then makes up a column family. In a relational databases, this would be equivalent to a collection of rows making up a table. The main difference is that in a column-family database, rows do not have to contain the same columns.
- RDBMS impose high cost of schema change for low cost of query change. Column families impose little to no cost in schema change, for slightly more cost in query change.
Differences from relational databases.
|Database Instance or MySQL instance
|Column (same for each row)
|Column (can be different per row)
- Column-family databases use peer-to-peer replication, meaning they do not have a master node.
- It is possible to set a columns value to a map of other columns, these are called super columns. The main column family would then typically be called a super column family. This is similar to an embedded document in MongoDB.
- When column-families are created, they need to be assigned to a keyspace, which is similar to a database in an RDBMS or document database.
- Just like key-value stores and document databases, it is possible to configure the read and write quorums to your application requirements. Read and write quorums can be set on a keyspace, column family, row or operation level.
- If you have a column which is read far more than any other column, this column can be used as the key (like a primary key) to optimise read performance. In Cassandra it is also possible to create bitmap indexes for other columns to improve read performance.
- Data is first written to a commit log, then written to an in-memory table called memtable (each column-family has its own memtable). A write is only considered successful once the memtable write has finished.
- Once a memtable is full, the data will be written to disk in the form of SSTables (Sorted String Tables), where they remain. SSTables are all sorted by key.
- If a node goes down before data is written to an SSTable, the commit log can be used to apply changes.
- If you ever need to fix all nodes to be up to date with the latest changes, this is possible by using the repair command. The repair command can work on an entire keyspace or a specific column-family. It’s worth noting this process can be very resource intensive.
- Cassandra provides a SQL like query language called CQL, which can be used as an alternative to Cassandra commands.
- Cassandra implements custom a column type called CounterColumnType, this can be used to increment the value of columns in a row. This is commonly used for creating counters.
- Cassandra implements expiring columns, meaning when the column is created, it is created with a TTL (Time to live) value. one the TTL expires, the column will be removed.
- In Cassandra each column has a time-stamp attached to it, this is useful for resolving write conflicts and dealing with stale data.
- When a node goes down, it will not affect reading or writing, the responsibilities will be take over by another node in the cluster. When a node comes back online, it’s responsibilities will be handed back to it. This is known as hinted handoff.
- Column-family databases are by design highly available. This is due to them having no master node and any node being available to accept reads and writes. This level of availability does however have the added trade-off against consistency.
- Cassandra has a consistency repair feature known as read repair. How this works is, when a read is done, there are two types of reads, direct reads and background reads. Direct reads represent the amount of nodes you have set to read as part of your consistency check. Background reads are the rest of the nodes. Cassandra will make sure all the background nodes reads have up to date data, any data which is out of date will automatically be issued updates.
- Cassandra has a tool for maintaining consistency on data which is rarely read, called Anti-Entropy Node Repair.
- Like key-value stores and document databases, column-family database do not support ACID transactions, but instead support the BASE properties.
- Aggregation of data in a query is not possible. Sums and averages etc have to be worked out by the client.
- Content Management – Column families could store page or post title and contents along with tags, categories, image paths, sidebars, etc. Separate column families could be used for comments or they could be stored as part of a super column family.
- Gaming – Cassandra’s excellent expiring column feature is fantastic for building games. A player casts a spell to increase their power by 5% for half an hour… easy peasy.