I’ve been hearing about MapReduce quite often lately, so I’ve decided to give it a closer look. MapReduce was originally developed and implemented by Google to perform task on massive amount of data (e.g., index the web). It is a programming model that aims to make distributed computing on large clusters (thousands) of machine easily accessible to the average programmer.
MapReduce programming model is inspired from functional programming (i.e., Lisp) and is quite simple to understand. To run an algorithm as a MapReduce distributed task, the programmer has to implement the algorithm in the form of two functions, “map” and “reduce”, the platform takes care of all the rest. Here’s an explanation from the original MapReduce paper by Dean and Ghemawat:
The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: map and reduce.
Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the reduce function.
The reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per reduce invocation. The intermediate values are supplied to the user’s reduce function via an iterator. This allows us to handle lists of values that are too large to fit in memory.
May sounds a bit strange at first, but let’s look at a very simple example. The example is not realistic but serves the purpose of explaining MapReduce.
Example: Counting Character Occurrence in a String
Imagine we want to count the number of times each character occurs in a string. The sequential algorithm to perform this task is pretty trivial:
for each character c in text
character_count[c] += 1
The resulting character counts for:
would be something like:
'd' => 1
'e' => 1
'h' => 1
'l' => 3
'o' => 2
'r' => 1
'w' => 1
' ' => 1
Counting Character Occurrence with MapReduce
To implement this algorithm using MapReduce programming model, we need to express it as map and reduce functions:
for each character c in text do
reduce(string char, list<string> values)
for each value v in values do
count += integer(v)
The map function of our algorithm goes through each character of an input string and store a count of “1″ in the intermediate key/value pair (through the emit_intermediate function).
The reduce function receives a character with an associated list of values (i.e., a list of “1″s) and sums up the values for the character. Once all the values are added, the reduce function outputs the total count for the current character (through the emit function).
Figure 1 illustrates counting the number of occurrences for each character in the string “hello world” with MapReduce. The map and reduce functions are executed in different processes and can be executed by multiple processes at the same time. In the Figure, M1 and M2 are worker processes executing the map function on the input data. R1 and R2 are processes executing the reduce function on the intermediate data generate by M1 and M2. In a real-world scenario, there could be a large number of worker processes, furthermore, the number of processes in the map phase need not be the same as for the reduce phase. Steps are labeled from 1 to 6, let’s explain each one of them.
1) The input data is partitioned into “splits”. In our example we split the “hello world” into two pieces (ignoring the white space for simplicity) and we assign one piece to each worker process (i.e., M1 and M2). Typically, at this step the data is broken into chunks of 16MB to 128MB.
2) Each worker passes their assigned sub-string to the map function implemented by the programmer.
3) The map function calls emit_intermediate which writes the data to a local
bucket, there is one bucket for each reduce process. In this example, keys are assigned to buckets by splitting the alphabet in two, a-m go to R1, n-z go to R2. However, in a read world scenario we would typically assign the buckets by taking a hash function of the key and apply the modulo of the number of reduce processes hash(key) mod R.
4) Each reduce process reads their bucket of data from each map process. This data is aggregated in a list for each key.
5) The reduce function implemented by the programmer is called on each “key/list of value” pair.
6) The reduce function emit the data which is written to the output.
This example aims to illustrate the “big picture” of MapReduce, but a lot of subtleties that can vary from an implementation to another were left out on purpose. If you are interested to know more about MapReduce, I encourage you to look at the resources listed below.
The MapReduce model seems pretty simplistic, but it turns out that in practice a lot of useful algorithms can be implemented in this framework, such as web indexing, machine learning algorithms, statistical analysis…
MapReduce References and Resources
- Hadoop - an open-source Java implementation of MapReduce by the Apache Foundation.
- Simple MapReduce in Ruby - a ruby implementation just to play around.
- Skynet - Another ruby MapReduce implementation.