-
Notifications
You must be signed in to change notification settings - Fork 5
Tutorial
The following tutorial shows how to use hadoop-benchmark on single machine (local laptop) using VirtualBox on OSX and Linux. Options for other platforms are available, by changing the driver used by the docker-machine (cf. below).
All the steps has been performed on a MacBook PRO 3.1 GHz Intel Core i7, 16GB RAM, running OSX 10.12.2. The screencast of all these steps on this system are available from: asciicast.
We will do the following:
- Create 4 node cluster
- one node is for distributed service discovery (running Consul kv-store) with 512MB or RAM
- one node is the Hadoop controller (running ResourceManager, NameNode) with 2048MB or RAM
- two node are for Hadoop workers (running NodeManager, DataNode) with 2048MB or RAM each
- Start vanilla Hadoop
- Run benchmarks
- Hadoop canonical benchmark
- Intel HiBench benchmarks
- Start customized Hadoop with a feedback control loop self-balancing job parallelism and throughput (cf. Zhang et al.)
- Run benchmarks
- Hadoop canonical benchmark
- Intel HiBench benchmarks
- Compare the results
- Destroy the cluster
Please note that we do not run the SWIM benchmarks as they are computationally expensive. Each run will either take a significant amount of time or simply won't run on a cluster created on a single laptop. The demonstration running hadoop-benchmark on the cluster in Grid5000 shows, among what is presented here, how to run a SWIM benchmark, extracts the results and make a basic comparison.
- Currently the hadoop-benchmark has been tested on Linux and OSX
- Install docker, docker-machine and VirtualBox
- Install docker
- Install docker-machine
- Install VirtualBox
- On OSX all can be installed using homebrew and homebrew cask
brew install docker docker-machine brew cask install virtualbox
- Check versions
- docker version >= 1.12
$ docker --version Docker version 1.12.6, build 78d1802
- docker-machine >= 0.8
$ docker-machine --version docker-machine version 0.8.2, build e18a919
- VirtualBox >= 5.1
$ VBoxManage --version 5.1.12r112440
- docker version >= 1.12
- Install latest version of hadoop-benchmark for github
$ git clone https://github.com/Spirals-Team/hadoop-benchmark.git Cloning into 'hadoop-benchmark'... remote: Counting objects: 1042, done. remote: Compressing objects: 100% (28/28), done. remote: Total 1042 (delta 13), reused 0 (delta 0), pack-reused 1012 Receiving objects: 100% (1042/1042), 1.60 MiB | 0 bytes/s, done. Resolving deltas: 100% (554/554), done.
- (optional) R >= 3.3.2
$ R --version R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch" Copyright (C) 2016 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin16.1.0 (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under the terms of the GNU General Public License versions 2 or 3. For more information about these matters see http://www.gnu.org/licenses/.
- (optional) R packages:
- tidyverse
Rscript -e 'library(tidyverse, quiet=T)'
- Hmisc
Rscript -e 'library(Hmisc, quiet=T)'
install.packages(c("tidyverse", "Hmisc"))
R command - tidyverse
Frist, create a cluster on top of which we can then provision Hadoop.
The hadoop-benchmark uses a simple configuration files to specify its settings. We start with a new configuration created from the configuration template used for vanilla Hadoop in a local cluster.
cd hadoop-benchmark
cp scenarios/vanilla-hadoop/local_cluster mycluster
The configuration contains a number of options:
# the name of the cluster
CLUSTER_NAME_PREFIX='local-hadoop'
# the driver to be used to create the cluster
# for list of available drivers go to https://docs.docker.com/machine/drivers/
DRIVER='virtualbox'
# the extra driver options for controller and compute nodes
# for list of available options `docker-machine create -d virtualbox --help | grep "\-\-virtualbox"`
DRIVER_OPTS='--virtualbox-memory 2048'
# override the memory for consul
DRIVER_OPTS_CONSUL='--virtualbox-memory 512'
# number of compute nodes
NUM_COMPUTE_NODES=1
# docker images to be used
HADOOP_IMAGE='hadoop-benchmark/hadoop'
HADOOP_IMAGE_DIR='scenarios/vanilla-hadoop/images/hadoop'
We will leave all the defaults but change the number of computing nodes to 2:
# number of compute nodes
NUM_COMPUTE_NODES=2
If you have less memory in you system, you can leave the default. If you create too cluster too big, you might run into problems with not having enough memory. Some processes might segfault because malloc could not allocate the required memory slots. Please note that this usually only happens in the local environments and not in real clusters.
The driver is set to virtualbox
, but there are other drivers available.
The docker-machine supports Microsoft Azure, G5K, Amazon EC2, GCE, VmWare, etc.
The complete list is available in the docker-machine documentation.
The DRIVER_OPTS
specify additional options.
In this tutorial we specify the memory for the virtualbox VMs (concretely for the controller and compute nodes).
We will use 2GB instead of the 1GB default.
There are many other options (they can be listed using docker-machine create -d virtualbox --help | grep virtualbox
) to allow further customization of the nodes.
We have also shown that we can override the driver options for individual machines by adding _<NODE_NAME_IN_CAPITAL>
(e.g. DRIVER_OPTS_CONTROLLER
).
In this step we create the cluster.
The nodes will be created using the specified DRIVER
.
In our case, all machines will be created as VirtualBox VMs.
We need to create together 4 nodes:
- one Hadoop controller node running YARN ResourceManager and HDFS NameNode services
- two Hadoop compute node (derived form
NUM_COMPUTE_NODES
) running YARN NodeManager and HDFS DataNode services - one node that acts as the cluster guardian. It runs consul which is distributed service discovery used by docker to know about nodes in the cluster
First, we make sure there are no docker machines:
$ docker-machine ls
NAME ACTIVE DRIVER STATE URL SWARM DOCKER ERRORS
Create the cluster:
CONFIG=mycluster ./cluster.sh create-cluster
Instead of prefixing the each command with CONFIG=mycluster
, we can simply export it:
export CONFIG=mycluster
./cluster.sh create-cluster
After the command has run, we can check the docker machines status:
$ docker-machine ls
NAME ACTIVE DRIVER STATE URL SWARM DOCKER ERRORS
local-hadoop-compute-1 - virtualbox Running tcp://192.168.99.102:2376 local-hadoop-controller v1.12.6
local-hadoop-compute-2 - virtualbox Running tcp://192.168.99.103:2376 local-hadoop-controller v1.12.6
local-hadoop-consul - virtualbox Running tcp://192.168.99.100:2376 v1.12.6
local-hadoop-controller - virtualbox Running tcp://192.168.99.101:2376 local-hadoop-controller (master) v1.12.6
- the first two nodes
local-hadoop-compute-{1,2}
are for the Hadoop workers - the
local-hadoop-consul
runs the distributed service discovery needed by docker - the
local-hadoop-controller
is designated for the resource manager
The local-hadoop-
prefix is derived from the CLUSTER_NAME_PREFIX
option in the configuration file.
The reason is that we can run multiple clusters at the same time and the prefix allows us to differentiate between them.
Graphically, excluding the local-hadoop-consul
(since it is really just an internal docker swarm things), we have the following:
We can also check this in the VirtualBox UI:
If we want to get an IP address we can simply run:
$ docker-machine ip local-hadoop-controller
192.168.99.101
or connect using SSH in case we need to debug something:
$ docker-machine ssh local-hadoop-controller
## .
## ## ## ==
## ## ## ## ## ===
/"""""""""""""""""\___/ ===
~~~ {~~ ~~~~ ~~~ ~~~~ ~~~ ~ / ===- ~~~
\______ o __/
\ \ __/
\____\_______/
_ _ ____ _ _
| |__ ___ ___ | |_|___ \ __| | ___ ___| | _____ _ __
| '_ \ / _ \ / _ \| __| __) / _` |/ _ \ / __| |/ / _ \ '__|
| |_) | (_) | (_) | |_ / __/ (_| | (_) | (__| < __/ |
|_.__/ \___/ \___/ \__|_____\__,_|\___/ \___|_|\_\___|_|
Boot2Docker version 1.12.6, build HEAD : 5ab2289 - Wed Jan 11 03:20:40 UTC 2017
Docker version 1.12.6, build 78d1802
docker@local-hadoop-controller:~$ exit
The big advantage of using docker-machine is that it all this works transparently among different drivers. So the same commands will work regardless if the node is deployed in VirtualBox or on Amazon EC2.
Now the cluster is ready and we start deploying docker containers.
The created cluster is ready to be provisioned with any Hadoop distribution. We start by deploying an Apache vanilla Hadoop distribution (version 2.7.1) and running some benchmarks. Then we will deploy a new Hadoop based on the same vanilla distribution but with some self-adaptive capabilities and compare the results.
The following command starts Hadoop on the cluster.
First, it builds the specified docker image (HADOOP_IMAGE
from HADOOP_IMAGE_DIR
) and then in starts the corresponding containers.
This command will build the corresponding docker images and then it starts them on the cluster nodes.
$ ./cluster.sh start-hadoop
... ... ...
... ... ...
Hadoop should be ready
To connect docker run: 'eval $(docker-machine env --swarm local-hadoop-controller)'
To connect a bash console to the cluster use: 'console' option
To connect to Graphite (WEB console visualizing collectd data), visit http://192.168.99.101
To connect to YARN ResourceManager WEB UI, visit http://192.168.99.101:8088
To connect to HDFS NameNode WEB UI, visit http://192.168.99.101:50070
To connect to YARN NodeManager WEB UI, visit:
http://192.168.99.102:8042 for compute-1
http://192.168.99.103:8042 for compute-2
To connect to HDFS DataNode WEB UI, visit:
http://192.168.99.102:50075 for compute-1
http://192.168.99.103:50075 for compute-2
If you plan to use YARN WEB UI more extensively, consider to add the following records to your /etc/hosts:
192.168.99.101 controller
192.168.99.102 compute-1
192.168.99.103 compute-2
To see what has happened, we can explore the running containers.
First, we need to tell to the docker command-line client which docker node it should communicate with.
The easiest is to use the shell-init
command.
$ ./cluster.sh shell-init
export DOCKER_TLS_VERIFY="1"
export DOCKER_HOST="tcp://192.168.99.101:3376"
export DOCKER_CERT_PATH="/Users/krikava/.docker/machine/machines/local-hadoop-controller"
export DOCKER_MACHINE_NAME="local-hadoop-controller"
# Run this command to configure your shell:
# eval $(docker-machine env --swarm local-hadoop-controller)
For this to make any effect, it has to be run in an eval:
eval $(./cluster.sh shell-init)
Now we can get an overview of what is running in the docker cluster:
$ docker info
Containers: 8
Running: 8
Paused: 0
Stopped: 0
Images: 10
Server Version: swarm/1.2.5
Role: primary
Strategy: spread
Filters: health, port, containerslots, dependency, affinity, constraint
Nodes: 3
local-hadoop-compute-1: 192.168.99.102:2376
└ ID: PBW3:TL6Z:COEV:NRMG:JQCH:7VES:GVHA:6BG6:NJ3U:AELV:MTLU:SHHV
└ Status: Healthy
└ Containers: 2 (2 Running, 0 Paused, 0 Stopped)
└ Reserved CPUs: 0 / 1
└ Reserved Memory: 0 B / 2.053 GiB
└ Labels: kernelversion=4.4.41-boot2docker, operatingsystem=Boot2Docker 1.12.6 (TCL 7.2); HEAD : 5ab2289 - Wed Jan 11 03:20:40 UTC 2017, provider=virtualbox, storagedriver=aufs, type=compute
└ UpdatedAt: 2017-01-17T17:17:46Z
└ ServerVersion: 1.12.6
local-hadoop-compute-2: 192.168.99.103:2376
└ ID: WLCB:IH2R:ZQT6:VXWQ:RPHE:IUJL:2HEG:3EIX:Q7XZ:XUYO:5TWN:RVJT
└ Status: Healthy
└ Containers: 2 (2 Running, 0 Paused, 0 Stopped)
└ Reserved CPUs: 0 / 1
└ Reserved Memory: 0 B / 2.053 GiB
└ Labels: kernelversion=4.4.41-boot2docker, operatingsystem=Boot2Docker 1.12.6 (TCL 7.2); HEAD : 5ab2289 - Wed Jan 11 03:20:40 UTC 2017, provider=virtualbox, storagedriver=aufs, type=compute
└ UpdatedAt: 2017-01-17T17:17:53Z
└ ServerVersion: 1.12.6
local-hadoop-controller: 192.168.99.101:2376
└ ID: SW6Y:AJ2G:LNMN:XYWN:ML4U:YDEX:FZ4L:6A6Q:RTYJ:HNJQ:QVKY:7DGY
└ Status: Healthy
└ Containers: 4 (4 Running, 0 Paused, 0 Stopped)
└ Reserved CPUs: 0 / 1
└ Reserved Memory: 0 B / 2.053 GiB
└ Labels: kernelversion=4.4.41-boot2docker, operatingsystem=Boot2Docker 1.12.6 (TCL 7.2); HEAD : 5ab2289 - Wed Jan 11 03:20:40 UTC 2017, provider=virtualbox, storagedriver=aufs, type=controller
└ UpdatedAt: 2017-01-17T17:18:01Z
└ ServerVersion: 1.12.6
Plugins:
Volume:
Network:
Swarm:
NodeID:
Is Manager: false
Node Address:
Security Options:
Kernel Version: 4.4.41-boot2docker
Operating System: linux
Architecture: amd64
CPUs: 3
Total Memory: 6.159 GiB
Name: a214f4308980
Docker Root Dir:
Debug Mode (client): false
Debug Mode (server): false
WARNING: No kernel memory limit support
We can see the three nodes running 8 containers. To see which containers we run:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0f4bfb7e6a33 hadoop-benchmark/hadoop "/entrypoint.sh compu" 19 minutes ago Up 19 minutes 2122/tcp, 8020/tcp, 8030-8033/tcp, 8040/tcp, 8088/tcp, 9000/tcp, 19888/tcp, 49707/tcp, 50010/tcp, 50020/tcp, 192.168.99.103:8042->8042/tcp, 50070/tcp, 50090/tcp, 192.168.99.103:50075->50075/tcp local-hadoop-compute-2/compute-2
ec42e2942a39 hadoop-benchmark/hadoop "/entrypoint.sh compu" 19 minutes ago Up 19 minutes 2122/tcp, 8020/tcp, 8030-8033/tcp, 8040/tcp, 8088/tcp, 9000/tcp, 19888/tcp, 49707/tcp, 50010/tcp, 50020/tcp, 192.168.99.102:8042->8042/tcp, 50070/tcp, 50090/tcp, 192.168.99.102:50075->50075/tcp local-hadoop-compute-1/compute-1
061dbb4e2f0e hadoop-benchmark/hadoop "/entrypoint.sh contr" 20 minutes ago Up 20 minutes 2122/tcp, 8020/tcp, 8030-8033/tcp, 8040/tcp, 8042/tcp, 9000/tcp, 19888/tcp, 49707/tcp, 50010/tcp, 192.168.99.101:8088->8088/tcp, 50020/tcp, 50075/tcp, 50090/tcp, 192.168.99.101:50070->50070/tcp local-hadoop-controller/controller
3e5345d10b11 hopsoft/graphite-statsd "/sbin/my_init" 20 minutes ago Up 20 minutes 192.168.99.101:80->80/tcp, 2004/tcp, 192.168.99.101:2003->2003/tcp, 192.168.99.101:8126->8126/tcp, 2023-2024/tcp, 192.168.99.101:8125->8125/udp local-hadoop-controller/graphite
This only shows 4 containers as the listing omits the ones that participate in forming the docker swarm.
The following schema presents a graphical overview of what has been deployed:
Once the ./cluster.sh start-hadoop
command finishes, it prints the information on how to access the cluster:
Hadoop should be ready
To connect docker run: 'eval $(docker-machine env --swarm local-hadoop-controller)'
To connect a bash console to the cluster use: 'console' option
To connect to Graphite (WEB console visualizing collectd data), visit http://192.168.99.101
To connect to YARN ResourceManager WEB UI, visit http://192.168.99.101:8088
To connect to HDFS NameNode WEB UI, visit http://192.168.99.101:50070
To connect to YARN NodeManager WEB UI, visit:
http://192.168.99.102:8042 for compute-1
http://192.168.99.103:8042 for compute-2
To connect to HDFS DataNode WEB UI, visit:
http://192.168.99.102:50075 for compute-1
http://192.168.99.103:50075 for compute-2
If you plan to use YARN WEB UI more extensively, consider to add the following records to your /etc/hosts:
192.168.99.101 controller
192.168.99.102 compute-1
192.168.99.103 compute-2
If you plan to use these WEB UIs, it is recommended to alter your /etc/hosts
file temporarily since the UIs use hostnames instead of IP and without the changes in hosts file some of the links might not be reachable.
There are number of available user interfaces:
-
YARN Resource Manager UI
-
Graphite - monitoring console dashboard (WEB console visualizing collectd data)
-
HDFS NameNode UI
-
-
compute-1
YARN NodeManager UI -
compute-2
YARN NodeManager UI
The Hadoop cluster is fully functional and ready to be used. To test it, we can connect to it:
./cluster.sh console
This in turn starts a new container in the cluster with the same base image as the all the other nodes.
We can access the other nodes in the cluster using there hostnames (i.e. controller
, compute-1
, compute-2
):
root@hadoop-console:/# ping controller
PING controller (10.0.0.3) 56(84) bytes of data.
64 bytes from controller.hadoop-net (10.0.0.3): icmp_seq=1 ttl=64 time=0.114 ms
64 bytes from controller.hadoop-net (10.0.0.3): icmp_seq=2 ttl=64 time=0.055 ms
^C
--- controller ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.055/0.084/0.114/0.030 ms
root@hadoop-console:/# ping compute-1
PING compute-1 (10.0.0.4) 56(84) bytes of data.
64 bytes from compute-1.hadoop-net (10.0.0.4): icmp_seq=1 ttl=64 time=0.490 ms
64 bytes from compute-1.hadoop-net (10.0.0.4): icmp_seq=2 ttl=64 time=0.420 ms
^C
--- compute-1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.420/0.455/0.490/0.035 ms
root@hadoop-console:/# ping compute-2
PING compute-2 (10.0.0.5) 56(84) bytes of data.
64 bytes from compute-2.hadoop-net (10.0.0.5): icmp_seq=1 ttl=64 time=0.536 ms
64 bytes from compute-2.hadoop-net (10.0.0.5): icmp_seq=2 ttl=64 time=0.389 ms
^C
--- compute-2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.389/0.462/0.536/0.076 ms
and run Hadoop commands:
-
list the compute nodes in the cluster:
root@hadoop-console:/# yarn node -list 17/01/17 17:20:59 INFO client.RMProxy: Connecting to ResourceManager at controller/10.0.0.3:8032 Total Nodes:2 Node-Id Node-State Node-Http-Address Number-of-Running-Containers compute-2:45454 RUNNING compute-2:8042 0 compute-1:45454 RUNNING compute-1:8042 0
-
list the applications in the cluster:
root@hadoop-console:/# yarn application -list 17/01/17 17:21:09 INFO client.RMProxy: Connecting to ResourceManager at controller/10.0.0.3:8032 Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):0 Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
-
get the report from HDFS:
root@hadoop-console:/# hdfs dfsadmin -report Configured Capacity: 38390448128 (35.75 GB) Present Capacity: 33287757824 (31.00 GB) DFS Remaining: 33287708672 (31.00 GB) DFS Used: 49152 (48 KB) DFS Used%: 0.00% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 ------------------------------------------------- Live datanodes (2): Name: 10.0.0.4:50010 (compute-1.hadoop-net) Hostname: compute-1 Decommission Status : Normal Configured Capacity: 19195224064 (17.88 GB) DFS Used: 24576 (24 KB) Non DFS Used: 2551345152 (2.38 GB) DFS Remaining: 16643854336 (15.50 GB) DFS Used%: 0.00% DFS Remaining%: 86.71% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 1 Last contact: Tue Jan 17 02:01:48 UTC 2017 Name: 10.0.0.5:50010 (compute-2.hadoop-net) Hostname: compute-2 Decommission Status : Normal Configured Capacity: 19195224064 (17.88 GB) DFS Used: 24576 (24 KB) Non DFS Used: 2551345152 (2.38 GB) DFS Remaining: 16643854336 (15.50 GB) DFS Used%: 0.00% DFS Remaining%: 86.71% DFS Remaining%: 86.71% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 1 Last contact: Tue Jan 17 02:01:50 UTC 2017
To quit the console, simply exit:
root@hadoop-console:/# exit
exit
The console is useful for checking the status of the cluster and for debugging.
You can think about it as logging into the cluster with all the Hadoop commands available.
There is also a possibility to execute commands directly at the controller using ./cluster.sh run-controller <command>
.
This will run the given command in the controller
docker container.
Now we are ready to run some benchmarks.
We start with a simple one to check the cluster works.
A benchmark is simply a MapReduce job (or any other job running on YARN).
So far we have collected three main benchmarks in the benchmarks
directory:
- Hadoop MapReduce Examples that contain the canonical Hadoop benchmarks such as wordcount or terasort (the complete list can be obtained from the console by running
./benchmarks/hadoop-mapreduce-examples/run.sh
). - Intel HiBench benchmarks.
- SWIM benchmarks (not shown in this tutorial).
A new benchmark can be created by creating a new docker image that will contain all corresponding files and shell script that will run it. For details, please have a look at the documentation.
We start with Hadoop examples to make sure the cluster is operational. The following will run the a map/reduce program that estimates Pi using a quasi-Monte Carlo method. We parameterize the job with 20 map tasks each computing 1000 samples per a map task:
$ ./benchmarks/hadoop-mapreduce-examples/run.sh pi 20 1000
... ... ...
Number of Maps = 20
Samples per Map = 1000
... ... ...
At the end, the benchmark outputs the results:
... ... ...
Job Finished in 73.413 seconds
Estimated value of Pi is 3.14280000000000000000
The cluster seems to be operational so we can proceed to run a full benchmark. For this we will use Intel HiBench benchmarks. This includes:
-
wordcount
, -
sort
, -
terasort
, - and
sleep
benchmarks
By default these benchmark are very small. This is to allow one to quickly execute them to see that everything works and then to scale up. More information how to use the HiBench in hadoop-benchmark, please check the guide.
The following command will first build the bechmark image and then it will start a new container located at the local-hadoop-controller
node that submits the jobs. At the end, it will print out the statistics and upload them to HDFS.
./benchmarks/hibench/run.sh
At the end the benchmark outputs the statistics:
Benchmarks finished
Type Date Time Input_data_size Duration(s) Throughput(bytes/s) Throughput/node
WORDCOUNT 2017-01-17 17:37:35 2746 37.770 72 36
SORT 2017-01-17 17:38:56 2582 38.263 67 33
TERASORT 2017-01-17 17:40:05 30000 25.937 1156 578
SLEEP 2017-01-17 17:40:35 0 25.182 0 0
The report has been uploaded to HDFS: /hibench-20170114-2200.report
To download, run ./cluster.sh hdfs-download '/hibench-20170114-2200.report'
Running the benchmark only once will not give us a lot of confidence about the performance. Instead we want to run it multiple times and use the measurement to build a confidence interval. The idea is that we run the benchmark n times, then download the reports from HDFS and run some analysis. All this can be done easily using hadoop-benchmark. This will take about 40 minutes to complete. The screencast is available on asciicast.
- Remove the existing report(s)
./cluster.sh hdfs dfs -rm "/hibench-*.report"
- Make sure no other jobs are running in the cluster
$ ./cluster.sh run-controller yarn application -list
... ... ...
... ... ...
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):0
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
- Run the benchmark 10 times
for i in $(seq 1 10); do ./benchmarks/hibench/run.sh; done
- Download all the files
./cluster.sh hdfs-download "/hibench-*.report"
- Move the files to a specific directory
$ mkdir -p results/hibench/vanilla
$ mv hibench* results/hibench/vanilla
$ ls -1 results/hibench/vanilla
hibench-20170114-1923.report
hibench-20170114-1927.report
hibench-20170114-1932.report
hibench-20170114-1936.report
hibench-20170114-1940.report
hibench-20170114-1944.report
hibench-20170114-1948.report
hibench-20170114-1952.report
hibench-20170114-1957.report
hibench-20170114-1901.report
- We have provided a sample R script to visualize the results of HiBench benchmarks.
$ ./benchmarks/hibench/analysis/hibench-report.R results/hibench
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag(): dplyr, stats
pdf
2
$ ls -1 *.pdf
hibench-duration.pdf
hibench-throughput.pdf
Benchmark durations (hibench-duration.pdf) |
Benchmark throughputs (hibench-throughput.pdf) |
We have run some benchmarks on a vanilla Hadoop distribution so we have a baseline to compare our self-adaptation to. Please note, that is by no means a rigor way to run benchmarks and evaluate performance of a computer system. Here we simplify the steps for the sake of brevity so the complete tutorial can be completed within 30 minutes.
Essentially, we are going to redo the steps we have done earlier, this time with a different Hadoop distribution. We will use the one we published in:
Zhang et al., Self-Balancing Job Parallelism and Throughput in Hadoop, 16th International Conference on Distributed Applications and Interoperable Systems (DAIS), Jun 2016. PDF
It is based on the Apache vanilla Hadoop version 2.7.1 (the same as above) with a feedback controll that dynamically reconfigures the Hadoop capacity scheduler to better balance job parallelism and throughput.
Since we have only once cluster, we need to stop the current Hadoop services so we can provision a new one that we want to test. The following command gracefully stop the currently running Hadoop services:
./cluster.sh stop-hadoop
No containers should be running after this point:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
Now, we essentially repeat the steps we have issued earlier, only with other CONFIG
.
cp ./scenarios/self-balancing-example/local_cluster mycluster-sa
The files are almost the same except for the Hadoop image configuration:
HADOOP_IMAGE='hadoop-benchmark/self-balancing-example'
HADOOP_IMAGE_DIR='scenarios/self-balancing-example/image'
This custom image is based on the vanilla Hadoop 2.7.1 and additionally it includes our Java-based feedback control loop that dynamically adjusts the runtime settings of the capacity scheduller. To see how to create such an image, please the guide.
Update the number of compute nodes to 2
NUM_COMPUTE_NODES=2
export CONFIG=mycluster-sa
./cluster.sh start-hadoop
Again, at the end, the command should output connection settings:
Hadoop should be ready
To connect docker run: 'eval $(docker-machine env --swarm local-hadoop-controller)'
To connect a bash console to the cluster use: 'console' option
To connect to Graphite (WEB console visualizing collectd data), visit http://192.168.99.101
To connect to YARN ResourceManager WEB UI, visit http://192.168.99.101:8088
To connect to HDFS NameNode WEB UI, visit http://192.168.99.101:50070
To connect to YARN NodeManager WEB UI, visit:
http://192.168.99.102:8042 for compute-1
http://192.168.99.103:8042 for compute-2
To connect to HDFS DataNode WEB UI, visit:
http://192.168.99.102:50075 for compute-1
http://192.168.99.103:50075 for compute-2
If you plan to use YARN WEB UI more extensively, consider to add the following records to your /etc/hosts:
192.168.99.101 controller
192.168.99.102 compute-1
192.168.99.103 compute-2
Since the number of nodes is the same, they should be the very same as before.
./benchmarks/hadoop-mapreduce-examples/run.sh pi 20 1000
... ... ...
... ... ...
Job Finished in 79.147 seconds
Estimated value of Pi is 3.14280000000000000000
./benchmarks/hibench/run.sh
At the end the benchmark outputs the statistics:
Benchmarks finished
Type Date Time Input_data_size Duration(s) Throughput(bytes/s) Throughput/node
WORDCOUNT 2017-01-17 18:21:56 2028 37.171 48 24
SORT 2017-01-17 18:23:08 2543 37.044 71 35
TERASORT 2017-01-17 18:24:24 30000 25.952 1155 577
SLEEP 2017-01-17 18:24:55 0 26.140 0 0
The report has been uploaded to HDFS: /hibench-20170117-1824.report
To download, run ./cluster.sh hdfs-download "/hibench-20170117-1824.report"
Again, we should run the benchmark multiple time so we can get a better confidence about the results. The steps are pretty much the same as in the first case. This will take about 40 minutes to complete. The screencast is available on asciicast.
- Remove the existing report(s)
./cluster.sh hdfs dfs -rm "/hibench-*.report"
- Make sure no other jobs are running in the cluster
$ ./cluster.sh run-controller yarn application -list
... ... ...
... ... ...
Total jobs:0
JobId State StartTime UserName Queue Priority UsedContainers RsvdContainers UsedMem RsvdMem NeededMem AM info
- Run the benchmark 10 times
for i in $(seq 1 10); do ./benchmarks/hibench/run.sh; done
- Download all the files
./cluster.sh hdfs-download "/hibench-*.report"
- Move the files to a specific directory
$ mkdir results/hibench/sa
$ mv hibench* results/hibench/sa
$ ls -1 results/hibench/sa
hibench-20170117-2023.report
hibench-20170117-2027.report
hibench-20170117-2030.report
hibench-20170117-2034.report
hibench-20170117-2038.report
hibench-20170117-2042.report
hibench-20170117-2046.report
hibench-20170117-2050.report
hibench-20170117-2054.report
hibench-20170117-2058.report
- The provided script now allow us to show the difference.
./benchmarks/hibench/analysis/hibench-report.R results/hibench
It will create a new dataset for each subdirectory of the directory given as argument. It will then compute the bootstrap confidence intervals and visualize them on both the Duration and Throughput.
Benchmark durations (hibench-duration.pdf) |
Benchmark throughputs (hibench-throughput.pdf) |
We can see that with the feedback control loop the benchmarks run slightly faster. However, the difference is very small. The reason for this is that the nodes in the cluster are very small. They have only 2GB or RAM which in YARN translates into possibility of running only two containers. The feedback controller is designed to balance the number of containers between application masters (in MapReduce case that is the process responsible of coordinating the complete MapReduce job) and individual application tasks (in MapReduce case these are the individual map and reduce tasks). For more details about how this particular feedback control loop works and results on a larger scale please refer to paper.
First we stop the Hadoop services
./cluster.sh stop-hadoop
And then we can stop the virtual machines
./cluster.sh stop-cluster
If you want to free all the space taken, you can also destroy the cluster which will remove all the created VMs
./cluster.sh destroy-cluster
Running a distributed system is never easy.
Even though hadoop-benchmark is trying to simply all the steps, one may encounter some problems.
The ./cluster.sh
commands are meant to be idempotent as much as possible to running something twice should produce the same result.
If something does not work as expected it is a good idea to retry the command.
If that does not work, one can try to undo the command if possible (stopping hadoop and then starting again, stoping the cluster and starting again).
If that does not work either one can always start fresh by running ./cluster.sh destroy-cluster
.
Finally, please submit an issue to github.