Skip to content
Mike Fisk edited this page Feb 16, 2018 · 8 revisions

In our CloudDP paper we compared performance across a variety of platforms:

  • Single laptop
  • 48-core SMP
  • 1000-node Hadoop cluster
  • A very heterogeneous cloud (63 computers with 184 total hardware threads running 4 different Linux distributions across 5 different U.S. cities and including 3 separate small clusters, 2 virtual machines at a commercial cloud hosting service, and a network of workstations in a computer lab.)

Our data was a 22GB, 87 million record climate dataset partitioned into 100 shards. The workload was a histogram computed using the cut and awk.

We showed that on the Hadoop cluster, Filemap was twice as fast than Hadoop streaming.

Platform Baseline Description Baseline Time FileMap Time Speedup Comments
Laptop Serial 315.7 410.6 0.8x No opportunity for parallelism; FileMap scheduler adds some inefficiency
Laptop Serial (gzip) 227.4 378.5 0.6x FileMap scheduler attempts to run second process; results in disk thrashing
SMP Serial 246.3 236.4 1x On this SMP, disk I/O is the bottleneck so parallelism doesn't help
SMP Serial (gzip) 236.4 24.3 10x When the inputs are compressed, disk I/O is reduced and we can get 10x parallelism
Cluster Hadoop Streaming 104.0 52.6 2x Twice as fast as Hadoop Streaming
Cloud Get + Serial 403.2 29.3 14x Nearly the same as co-located SMP
Cloud Get + Serial (gzip) 348.9 35.4 10x Ratio of CPU/disk favors uncompressed
Clone this wiki locally