Home

FileMap

FileMap is a file-based map-reduce system for data-parallel computation.

Why Map-Reduce?

The map-reduce method of parallel computing was introduced by Google and further popularized by open source implementations like Hadoop, Disco, and others. Map-Reduce is remarkable in its simplicity and scalability. Traditional parallel environments have been based either on explicit message-passing APIs or on the appearance of a global shared-memory system. In contrast, Map-Reduce provides a rigid data-flow model in which the user need only write discrete kernel functions that fit within that dataflow. This restrictive model does not support many communication patterns in parallel codes. However, it is sufficient for a large number of data-intensive computing tasks. In return for accepting these restrictions, programmers can write simple, serial functions that a map-reduce run-time can parallelize very effectively.

What is unique about FileMap among other Map-Reduce frameworks?

FileMap is developed around several characteristic themes:

File-based, rather than tuple-based processing
Reuse existing domain-specific and POSIX file processing tools
Thin layer on top of a POSIX (Linux, Unix, MacOS, etc.) environment. (Don’t re-invent the wheel. Use OpenSSH for network communication & authentication; use existing file systems & file access control)
Intermediate result caching
Data replication
Streaming (Still largely a work in progress.)
No privileged user access or software installation required
Support multiple coalitions of participating systems

Examples

The classic map-reduce example is to compute word frequency in a text corpus:

fm map -i "/checkov/*" "sed -f words.sed | fm split -n 9 |> sort | uniq -c"

Let’s walk through this example step-by-step: First, the FileMap command is “fm”. You must always specify a sub-command and in this case it is “map”, which is used to specify and execute a computation. The “-i” option specifies a wildcard expression that specifies the input files to process. The remainder of the command is an execution pipeline that is applied to each input file. While the Unix pipe (|) symbol is used, each element of the pipeline is actually passed an input filename (the original or intermediate results) and has its stdout and stderr saved in files.

The first pipeline element uses sed to extract a list of words, one per line, from a text document. The words.sed script is as follows:

s/[\"\(\'\-\/]*\([a-zA-Z]*\) */\1\
/g

The “fm split” command partitions an input file into a specified number of files (9 in this case). In this case, a hash is computed on each input line to determine which output partition the line belongs to. Importantly, the same input word always gets put in the same partition. Commands that know how to partition data use the FMOUTPUT environment variable to obtain the output filename base. To that base, they add an extension number for the partition (in this case .1 through .9).

The “|>” operator is a reduce operation. First, all of the input files prior to this point in the pipeline are redistributed among the nodes. All files with the same suffix will be stored to the same node(s). The command on the right-hand side of the “|>” operator is responsible for reducing all of the input files for a given partition. This means that it will be executed with a list of input files constituting a partition. Each partition is processed separately (in parallel). In the case of the Unix “sort” command, it produces a single sorted output from multiple inputs.

Finally, the output of each sorted partition is passed to a “uniq -c” instance, which is the Unix command to tally up the number of identical consecutive lines (words in this case). The “fm map” command implicitly concatenates the outputs of the last stage of the pipeline and outputs them to stdout. In this case, this prints our final word frequency list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

FileMap

Why Map-Reduce?

What is unique about FileMap among other Map-Reduce frameworks?

Examples

Clone this wiki locally