-
Notifications
You must be signed in to change notification settings - Fork 8
1. Deployment
The simple Docker image has been presented in the getting started section already. It allows to store data in Apache Parquet file, but makes no use of other storage engines or distribution. For starting the simple deployment run
docker run --name adampro -p 5890:5890 -p 9099:9099 -p 4040:4040 -d vitrivr/adampro:latest
where port 5890 denotes the port to connect with an application to ADAMpro, port 9099 serves the ADAMpro web UI and port 4040 is for the Spark web UI.
Our Docker containers come with an update script located at /adampro/update.sh
, which allows you to check out the newest version of the code from the repository and re-build the jars without creating a new container (and therefore loosing existing data). To run the update routine, run in your host system:
docker exec adampro /adampro/update.sh
Note that the Docker container makes use of a number of environment variables which can be adjusted, e.g., for better performance. Note in particular ADAMPRO_MEMORY
which is set to 2g
. A few other environment variables, e.g., ADAMPRO_START_WEBUI
, can be used to denote whether certain parts (in this case the web UI) of ADAMpro should be started or not.
We have presented how to start a minimal Docker container. For self-container containers which come with PostgreSQL and solr within the container, you may use
docker run --name adampro -p 4040:4040 -p 5890:5890 -p 9099:9099 -p 5432:5432 -p 8983:8983 -d vitrivr/adampro:latest-selfcontained
where port 4040 serves the Spark web UI and port 9099 the ADAMpro UI. Port 5890 allows to connect using the grpc interface to ADAMpro. Port 5432 exposes the Postgres instance and 8983 the Solr instance
For demo purposes, this container can be filled with data. We provide the OSVC data for download. Untar the folder and copy to adampro/data
; then restart the container.
docker cp osvc.tar.gz adampro:/adampro/data
docker exec adampro tar -C /adampro/data/ -xzf /adampro/data/osvc.tar.gz
docker restart adampro
Note that you may want to adjust the number of workers (set ADAMPRO_MASTER
to something like local[X]
where X denotes the number of workers) and the memory used by Apache Spark (set ADAMPRO_MEMORY
).
Clone the repository from github. Note that the folder grpc
is a sub-module. Hence, you will have to run
git clone --recursive https://github.com/vitrivr/ADAMpro.git
for cloning the repository. ADAMpro can be built using sbt. We provide various sbt tasks to simplify the deployment and development.
-
assembly
creates a fat jar with ADAMpro to submit to spark, runsbt proto
first -
proto
generates a jar file from the grpc folder and includes it in the main project (this is necessary, as shadowing is necessary of the netty dependency)
Other helpful sbt tasks include
-
dependencyTree
displays the dependencies to other packages -
stats
to show code statistics
Because of its project structure, for building ADAMpro, you have to first run
sbt proto
in the main folder, which generates the proto-files and creates a jar file containing the proto sources into the ./lib/
folder.
Running
sbt assembly
(and sbt web/assembly
for the ADAMpro UI), a jar file is created which can then be submitted to Apache Spark using
./spark-submit --master "local[4]" --driver-memory 2g --executor-memory 2g --class org.vitrivr.adampro.main.Startup $ADAM_HOME/ADAMpro-assembly-0.1.0.jar
ADAMpro can also be started locally, e.g., from an IDE. For this, remove the % "provided"
statements from build.sbt
and the marked line ExclusionRule("io.netty")
, and run the main class org.vitrivr.adampro.main.Startup
. You can use
sbt run
for running ADAMpro, as well. Note that the storage engines specified in the configuration have to be running already or you have to adjust the config file accordingly.
We provide a number of docker-compose scripts which can be used to setup the full environment. Check out the ADAMpro repository and go to the folder scripts/docker
and run
docker-compose up
This will start up a master and a single worker node and separate containers for PostgreSQL, Apache Cassandra and Apache solr able to communicate with each other. To add more workers (note that the number of masters is limited to 1), run the scale
command and specify the number of workers you would like to deploy in total:
docker-compose scale worker=5
Note that this setup will not use Hadoop for creating a HDFS, but will rather just mount a folder to all Docker containers (both master and worker container). Therefore this deployment will only work if all containers run on one single machine.
ADAMpro can be started in a virtually distributed setup using docker-compose. The folder scripts/docker-hdfs
contains a docker-compose.yml
; move into the docker-hdfs
folder and run:
docker-compose up
This will start up a master and a single worker node. Note that using the scale
command of docker-compose
you may create multiple workers; however, the number of master nodes (and Hadoop name nodes) is limited to 1.
ADAMpro can be started in a truly distributed setup with independent nodes connected via network. On the master node run
export ENV_ADAMPRO_MASTER_HOSTNAME=$HOSTNAME
export ENV_ADAMPRO_MEMORY= #choose a certain amount of memory
docker run --name adamproHDFSMaster --network=host -p 2122:2122 -p 4040:4040 -p 5005:5005 -p 5432:5432 -p 5890:5890 -p 6066:6066 -p 7001:7001 -p 7002:7002 -p 7003:7003 -p 7004:7004 -p 7005:7005 -p 7006:7006 -p 7077:7077 -p 8020:8020 -p 8030:8030 -p 8031:8031 -p 8032:8032 -p 8080:8080 -p 8088:8088 -p 8983:8983 -p 9000:9000 -p 9099:9099 -p 19888:19888 -p 38000:38000 -p 39000:39000 -p 50010:50010 -p 50020:50020 -p 50070:50070 -p 50075:50075 -p 50090:50090 -e "ADAMPRO_DRIVER_MEMORY=$ENV_ADAMPRO_MEMORY" -e "ADAMPRO_MASTER=spark://$ENV_ADAMPRO_MASTER_HOSTNAME:7077" -e "ADAMPRO_MASTER_HOSTNAME=$ENV_ADAMPRO_MASTER_HOSTNAME" -e "ADAMPRO_EXECUTOR_MEMORY=$ENV_ADAMPRO_MEMORY" -e "SPARK_PUBLIC_DNS=localhost" --entrypoint="/adampro/bootstrap.sh" -d vitrivr/adampro:latest-hdfs -d --masternode
where you should set the amount of memory to use. On the worker nodes run
export ENV_ADAMPRO_MASTER_HOSTNAME= #set host name of master host
export ENV_ADAMPRO_MASTER_IP= #set IP address of master host
export ENV_ADAMPRO_MEMORY= #choose a certain amount of memory
docker run --name adamproHDFSWorker --network=host -p 2122:2122 -p 7012:7012 -p 7013:7013 -p 7014:7014 -p 7015:7015 -p 7016:7016 -p 8020:8020 -p 8030:8030 -p 8031:8031 -p 8032:8032 -p 8081:8081 -p 8881:8881 -p 9000:9000 -p 38000:38000 -p 39000:39000 -p 50010:50010 -p 50020:50020 -p 50070:50070 -p 50075:50075 -e "ADAMPRO_MASTER=spark://$ENV_ADAMPRO_MASTER_HOSTNAME:7077" -e "ADAMPRO_MASTER_HOSTNAME=$ENV_ADAMPRO_MASTER_HOSTNAME" -e "ADAMPRO_EXECUTOR_MEMORY=$ENV_ADAMPRO_MEMORY" -e "SPARK_WORKER_INSTANCES=1" --add-host $ENV_ADAMPRO_MASTER_HOSTNAME:$ENV_ADAMPRO_MASTER_IP --entrypoint="/adampro/bootstrap.sh" -d vitrivr/adampro:latest-hdfs -d --workernode
where you should set the hostname and the ip address of the master node, and the amount of memory to use.
To open up the web UI of ADAMpro go to the master node. Port 9099 will serve, as usual the web UI. Similarly, for connecting via grpc to ADAMpro, connect to the master.
Consider the official documentation and this inofficial documentation for more information on how to use the images with Docker swarm and how to set up a Docker swarm cluster.