Performance

In our CloudDP paper we compared performance across a variety of platforms:

Single laptop
48-core SMP
1000-node Hadoop cluster
A very heterogeneous cloud (63 computers with 184 total hardware threads running 4 different Linux distributions across 5 different U.S. cities and including 3 separate small clusters, 2 virtual machines at a commercial cloud hosting service, and a network of workstations in a computer lab.)

Our data was a 22GB, 87 million record climate dataset partitioned into 100 shards. The workload was a histogram computed using the cut and awk.

We showed that on the Hadoop cluster, Filemap was twice as fast than Hadoop streaming.

Platform	Baseline Description	Baseline Time	FileMap Time	Speedup	Comments
Laptop	Serial	315.7	410.6	0.8x	No opportunity for parallelism; FileMap scheduler adds some inefficiency
Laptop	Serial (gzip)	227.4	378.5	0.6x	FileMap scheduler attempts to run second process; results in disk thrashing
SMP	Serial	246.3	236.4	1x	On this SMP, disk I/O is the bottleneck so parallelism doesn't help
SMP	Serial (gzip)	236.4	24.3	10x	When the inputs are compressed, disk I/O is reduced and we can get 10x parallelism
Cluster	Hadoop Streaming	104.0	52.6	2x	Twice as fast as Hadoop Streaming
Cloud	Get + Serial	403.2	29.3	14x	Nearly the same as co-located SMP
Cloud	Get + Serial (gzip)	348.9	35.4	10x	Ratio of CPU/disk favors uncompressed

Provide feedback