MongoDB – Part 7 – MapReduce

MapReduce is a multi-step programming paradigm, which has been around for 2 decades. The goal of MapReduce is to break down a collection of data, sometimes from different machines on a network. Executing MapReduce jobs in parallel, over multiple machines, will significantly improve processing times.

The concept of MapReduce is inspired by the map and reduce functions in functional programming. When MapReduce was defined, it was as a 3 step process (Map, Shuffle, Reduce), MongoDB have taken this one step further and added a finaliser step. So lets see what’s involved in each step.

MapReduce Steps

Map – Maps every document in a collection with a key, value pair. What this means is, each document in a collection will be run through a map function individually and a new form of document should be emitted. This could be as simple as passing a document into a map function and multiplying a score value by 10. Think of map as being map one document to another document. Each document which is emitted should also have a key, which will be used in the shuffle and reduce steps.

Shuffle – this takes the documents returned from the map step and aggregate them by their key, this assures each key has a single record which is mapped to multiple emitted values. This is done entirely in the background.

Reduce – This step is used to break down the results of the shuffle step. The reduce function is run for every key that was emitted from the shuffle step. The reduce function is also passed a collection of the emitted values to aggregate.

Finaliser – This is a final step, it will receive the output from the reduce step and perform one further reduce on it before returning the response to the client. This step is optional.

Using MapReduce

Using MapReduce is actually pretty simple, you need to define your map, reduce and maybe finalise functions and then execute them using the runCommand method.

db.runCommand({"mapreduce" : "collectionName", "map" : mapFunction, "reduce" : reduceFunction})

Example MapReduce

Search all documents in a collection and find every field name being used.

// This map is called for each document in the collection
map = function() {
    // Loop over each field in the collection
    for (var key in this) {
        // For each field, emit is called with the field name as the key and {count : 1} as the value.
        emit(key, {count : 1});
    }
};

// This is run for every field name that the map function found.
// key = field name, emit = A collection of {count : 1} objects
reduce = function(key, emits) {
    total = 0;
    for (var i in emits) {
        total += emits[i].count;
    }
    return {"count" : total};
}

// This outputs a bunch of information about, if the MapReduce was successful, a  collection name, where the result can be found, among other things.
db.runCommand({"mapreduce" : "collectionName", "map" : map, "reduce" : reduce})

MapReduce Keys

When running MapReduce jobs it can be useful to pass along additional keys, other than map and reduce, below are so of the most useful keys:

finalize – This takes a function with the signature finalize(prev). This is called for each value returned by the reduce function, it can be used to further reduce the response.

keeptemp – Takes a boolean value. Should the collection containing the response be kept, by default it will be removed when the connection is closed.

out – The name of the collection, where the response should be saved. Setting this key, will automatically set keeptemp to true.

query – Takes an object literal of criteria for data selection e.g. { “name” : “simon” }

limit – The amount of documents you want to select.

MapReduce Tips

The output from a maps emit, must be the same as the response from a reduce. This way the output from 1 reduce step can be sent into another reduce step.

MapReduce is slow and should not be used with real time data

Aggregation functions

// Total documents in a collection
db.collection.count()

// Get every distinct age in the users collection
db.runCommand({"distinct" : "users", "key" : "age"})

Conclusion

MapReduce is a powerful very powerful concept within computer science. There are books devoted to just this subject and the fact that it’s still around after 20 years proves its credibility. Just be careful not to get stuck in the one size fits all trap also aptly named the golden hammer antipattern. MongoDB is a great all round database, but MapReduce isn’t its strongest feature. If crunching huge amounts of data is what your doing, you may be better off using Apache Hadoop or something similar.

3 Love This

Simon Jakowicz

Just another blogger