-
Notifications
You must be signed in to change notification settings - Fork 16
Performance
Mike Fisk edited this page Feb 16, 2018
·
8 revisions
In our CloudDP paper we compared performance across a variety of platforms:
- Single laptop
- 48-core SMP
- 1000-node Hadoop cluster
- A very heterogeneous cloud (63 computers with 184 total hardware threads running 4 different Linux distributions across 5 different U.S. cities and including 3 separate small clusters, 2 virtual machines at a commercial cloud hosting service, and a network of workstations in a computer lab.)
Our data was a 22GB, 87 million record climate dataset partitioned into 100 shards. The workload was a histogram computed using the cut and awk.
We showed that on the Hadoop cluster, Filemap was twice as fast than Hadoop streaming.
Platform | Baseline Description | Baseline Time | FileMap Time | Speedup | Comments |
---|---|---|---|---|---|
Laptop | Serial | 315.7 | 410.6 | 0.8x | No opportunity for parallelism; FileMap scheduler adds some inefficiency |
Laptop | Serial (gzip) | 227.4 | 378.5 | 0.6x | FileMap scheduler attempts to run second process; results in disk thrashing |
SMP | Serial | 246.3 | 236.4 | 1x | On this SMP, disk I/O is the bottleneck so parallelism doesn't help |
SMP | Serial (gzip) | 236.4 | 24.3 | 10x | When the inputs are compressed, disk I/O is reduced and we can get 10x parallelism |
Cluster | Hadoop Streaming | 104.0 | 52.6 | 2x | Twice as fast as Hadoop Streaming |
Cloud | Get + Serial | 403.2 | 29.3 | 14x | Nearly the same as co-located SMP |
Cloud | Get + Serial (gzip) | 348.9 | 35.4 | 10x | Ratio of CPU/disk favors uncompressed |