Skip to content

1. Deployment

Ivan edited this page Nov 15, 2018 · 1 revision

Simple Docker container

The simple Docker image has been presented in the getting started section already. It allows to store data in Apache Parquet file, but makes no use of other storage engines or distribution. For starting the simple deployment run

docker run --name adampro -p 5890:5890 -p 9099:9099 -p 4040:4040 -d vitrivr/adampro:latest

where port 5890 denotes the port to connect with an application to ADAMpro, port 9099 serves the ADAMpro web UI and port 4040 is for the Spark web UI.

Our Docker containers come with an update script located at /adampro/update.sh, which allows you to check out the newest version of the code from the repository and re-build the jars without creating a new container (and therefore loosing existing data). To run the update routine, run in your host system:

docker exec adampro /adampro/update.sh

Note that the Docker container makes use of a number of environment variables which can be adjusted, e.g., for better performance. Note in particular ADAMPRO_MEMORY which is set to 2g. A few other environment variables, e.g., ADAMPRO_START_WEBUI, can be used to denote whether certain parts (in this case the web UI) of ADAMpro should be started or not.

Self-contained Docker container

We have presented how to start a minimal Docker container. For self-container containers which come with PostgreSQL and solr within the container, you may use

docker run --name adampro -p 4040:4040 -p 5890:5890 -p 9099:9099 -p 5432:5432 -p 8983:8983 -d vitrivr/adampro:latest-selfcontained

where port 4040 serves the Spark web UI and port 9099 the ADAMpro UI. Port 5890 allows to connect using the grpc interface to ADAMpro. Port 5432 exposes the Postgres instance and 8983 the Solr instance

For demo purposes, this container can be filled with data. We provide the OSVC data for download. Untar the folder and copy to adampro/data; then restart the container.

docker cp osvc.tar.gz adampro:/adampro/data
docker exec adampro tar -C /adampro/data/ -xzf /adampro/data/osvc.tar.gz 
docker restart adampro

Note that you may want to adjust the number of workers (set ADAMPRO_MASTER to something like local[X] where X denotes the number of workers) and the memory used by Apache Spark (set ADAMPRO_MEMORY).

Native deployment

Clone the repository from github. Note that the folder grpc is a sub-module. Hence, you will have to run

git clone --recursive https://github.com/vitrivr/ADAMpro.git 

for cloning the repository. ADAMpro can be built using sbt. We provide various sbt tasks to simplify the deployment and development.

  • assembly creates a fat jar with ADAMpro to submit to spark, run sbt proto first
  • proto generates a jar file from the grpc folder and includes it in the main project (this is necessary, as shadowing is necessary of the netty dependency)

Other helpful sbt tasks include

  • dependencyTree displays the dependencies to other packages
  • stats to show code statistics

Because of its project structure, for building ADAMpro, you have to first run

sbt proto

in the main folder, which generates the proto-files and creates a jar file containing the proto sources into the ./lib/ folder.

Running

sbt assembly

(and sbt web/assembly for the ADAMpro UI), a jar file is created which can then be submitted to Apache Spark using

./spark-submit --master "local[4]" --driver-memory 2g --executor-memory 2g --class org.vitrivr.adampro.main.Startup $ADAM_HOME/ADAMpro-assembly-0.1.0.jar

ADAMpro can also be started locally, e.g., from an IDE. For this, remove the % "provided" statements from build.sbt and the marked line ExclusionRule("io.netty"), and run the main class org.vitrivr.adampro.main.Startup. You can use

sbt run

for running ADAMpro, as well. Note that the storage engines specified in the configuration have to be running already or you have to adjust the config file accordingly.

Deployment using docker-compose

We provide a number of docker-compose scripts which can be used to setup the full environment. Check out the ADAMpro repository and go to the folder scripts/docker and run

docker-compose up

This will start up a master and a single worker node and separate containers for PostgreSQL, Apache Cassandra and Apache solr able to communicate with each other. To add more workers (note that the number of masters is limited to 1), run the scale command and specify the number of workers you would like to deploy in total:

docker-compose scale worker=5

Note that this setup will not use Hadoop for creating a HDFS, but will rather just mount a folder to all Docker containers (both master and worker container). Therefore this deployment will only work if all containers run on one single machine.

Virtually distributed deployment using HDFS with docker-compose

ADAMpro can be started in a virtually distributed setup using docker-compose. The folder scripts/docker-hdfs contains a docker-compose.yml; move into the docker-hdfs folder and run:

docker-compose up

This will start up a master and a single worker node. Note that using the scale command of docker-compose you may create multiple workers; however, the number of master nodes (and Hadoop name nodes) is limited to 1.

Distributed deployment using HDFS with Docker

ADAMpro can be started in a truly distributed setup with independent nodes connected via network. On the master node run

export ENV_ADAMPRO_MASTER_HOSTNAME=$HOSTNAME 
export ENV_ADAMPRO_MEMORY= #choose a certain amount of memory
docker run --name adamproHDFSMaster --network=host -p 2122:2122 -p 4040:4040 -p 5005:5005 -p 5432:5432 -p 5890:5890  -p 6066:6066 -p 7001:7001 -p 7002:7002 -p 7003:7003 -p 7004:7004 -p 7005:7005 -p 7006:7006 -p 7077:7077 -p 8020:8020 -p 8030:8030 -p 8031:8031 -p 8032:8032 -p 8080:8080 -p 8088:8088 -p 8983:8983  -p 9000:9000 -p 9099:9099 -p 19888:19888 -p 38000:38000 -p 39000:39000 -p 50010:50010 -p 50020:50020 -p 50070:50070 -p 50075:50075 -p 50090:50090 -e "ADAMPRO_DRIVER_MEMORY=$ENV_ADAMPRO_MEMORY" -e "ADAMPRO_MASTER=spark://$ENV_ADAMPRO_MASTER_HOSTNAME:7077"  -e "ADAMPRO_MASTER_HOSTNAME=$ENV_ADAMPRO_MASTER_HOSTNAME"  -e "ADAMPRO_EXECUTOR_MEMORY=$ENV_ADAMPRO_MEMORY"  -e "SPARK_PUBLIC_DNS=localhost" --entrypoint="/adampro/bootstrap.sh" -d vitrivr/adampro:latest-hdfs  -d --masternode

where you should set the amount of memory to use. On the worker nodes run

export ENV_ADAMPRO_MASTER_HOSTNAME= #set host name of master host
export ENV_ADAMPRO_MASTER_IP= #set IP address of master host
export ENV_ADAMPRO_MEMORY= #choose a certain amount of memory
docker run --name adamproHDFSWorker --network=host -p 2122:2122 -p 7012:7012 -p 7013:7013 -p 7014:7014 -p 7015:7015 -p 7016:7016 -p 8020:8020  -p 8030:8030  -p 8031:8031 -p 8032:8032 -p 8081:8081 -p 8881:8881 -p 9000:9000 -p 38000:38000 -p 39000:39000 -p 50010:50010 -p 50020:50020 -p 50070:50070 -p 50075:50075 -e "ADAMPRO_MASTER=spark://$ENV_ADAMPRO_MASTER_HOSTNAME:7077"  -e "ADAMPRO_MASTER_HOSTNAME=$ENV_ADAMPRO_MASTER_HOSTNAME" -e "ADAMPRO_EXECUTOR_MEMORY=$ENV_ADAMPRO_MEMORY"  -e "SPARK_WORKER_INSTANCES=1" --add-host $ENV_ADAMPRO_MASTER_HOSTNAME:$ENV_ADAMPRO_MASTER_IP --entrypoint="/adampro/bootstrap.sh" -d vitrivr/adampro:latest-hdfs -d --workernode

where you should set the hostname and the ip address of the master node, and the amount of memory to use.

To open up the web UI of ADAMpro go to the master node. Port 9099 will serve, as usual the web UI. Similarly, for connecting via grpc to ADAMpro, connect to the master.

Deployment using Docker swarm

Consider the official documentation and this inofficial documentation for more information on how to use the images with Docker swarm and how to set up a Docker swarm cluster.