-
Notifications
You must be signed in to change notification settings - Fork 8
Home
ADAMpro is the persistent polystore (based on Apache Spark) for all data required for retrieval.
In the following, we present how to easily setup and run ADAMpro. For this, we make use of the Docker image released on Docker Hub.
Pull the image using
docker pull vitrivr/adampro:latest
or run directly using (with the recommended ports being opened):
docker run --name adampro -p 5890:5890 -p 9099:9099 -p 4040:4040 -d vitrivr/adampro:latest
where port 5890 denotes the port to connect with an application to ADAMpro, port 9099 serves the ADAMpro web UI and port 4040 is for the Spark web UI.
After the creation of the container, you can navigate to
http://localhost:9099
to open the ADAMpro UI (and http://localhost:4040
for the Spark UI). Furthermore, you can connect on port 5890 to make use of the database.
Using the aforementioned options for creating the Docker container, will create a container which stores the data in Apache Parquet files. Other Docker tags are available for containers which e.g., come with PostgreSQL and solr installed, or which use HDFS.
ADAMpro can be configured using a configuration file. This repository contains a ./conf/
folder with configuration files.
-
application.conf
is used when running ADAMpro from an IDE -
assembly.conf
is the conf file included in the assembly jar (when runningsbt assembly
)
When starting ADAMpro, you can provide a adampro.conf
file in the same path as the jar, which is then used instead of the default configuration. (Note the file adampro.conf.template
which is used as a template for the Docker container.)
The configuration file can be used to specify configurations for running ADAMpro. The file ADAMConfig.scala reads the configuration file and provides the configurations to the application.
The file contains information on
- the path to all the internal files (catalog, etc.), e.g.,
internalsPath = "/adampro/internals"
- the grpc port, e.g.,
grpc {port = "5890"}
- the storage engines to use, e.g.,
engines = ["parquet", "index", "postgres", "postgis", "cassandra", "solr"]
For all the storage engines specified, in the storage
section, more details have to be provided (note that the name specified in engines
must match the name in the storage
section):
parquet {
engine = "ParquetEngine"
hadoop = true
basepath = "hdfs://spark:9000/"
datapath = "/adampro/data/"
}
or
parquet {
engine = "ParquetEngine"
hadoop = false
path = "~/adampro-tmp/data/"
}
The parameters specified in here are passed directly to the storage engines; it may make sense to consider the code of the single storage engine to see which parameters are necessary to specify (or to consider the exemplary configuration files in the configuration folder). The name of the class is specified in the field engine
.
ADAMpro builds on Apache Spark 2 and uses a large variety of libraries and packages, e.g. Google Protocol Buffers and grpc. The repository has the following structure:
-
conf
folder for configuration files; note that the conf folder is automatically included to the resources -
grpc
the proto file (included from the proto sub-repository) -
grpcclient
general grpc client code for communicating with the grpc server -
scripts
useful scripts for deploying running ADAMpro -
src
ADAMpro sources -
web
web UI of ADAMpro
ADAMpro comes with a set of unit tests which can be run from the test package. Note that for having all test pass, a certain setup is necessary. For instance, for having the PostGIS test pass, the database has to be set up and it must be configured in the configuration file. You may use the script setupLocalUnitTests.sh
for setting up all the necessary Docker containers for then performing the unit tests.
We recommend the use of IntelliJ IDEA for developing ADAMpro. It can be run locally using the run commands in the IDE for debugging purposes.
Note that the behaviour of ADAMpro, when run locally, might be different than when submitted to Apache Spark (using ./spark-submit
), in particular because of the inclusion of different package versions (e.g., Apache Spark will come with a certain version of netty, which is used even if build.sbt
includes a newer version; we refrain from using the spark.driver.userClassPathFirst
option as this is experimental).
ADAMpro can be debugged even if being submitted to Spark. By setting the debugging option in the SPARK_SUBMIT_OPTS
command before submitting, a remote debugger can be attached:
export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005
In here, we have opened port 5005 and, given the suspend
option, have the application wait until a debugger attaches.
The Docker container we provide has the SPARK_SUBMIT_OPTS
options set and we use port 5005 in the Docker containers provided for debugging (however, note that the suspend
option which makes the application wait until a debugger attaches is turned off in the Docker container).
In your IDE, bind to the application by setting up remote debugging on the port specified. For more information on how to use remote debugging consider e.g., this article: https://community.hortonworks.com/articles/15030/spark-remote-debugging.html
For checking the performance of ADAMpro, also consider the creation of flame graphs. For more information see here.
For introductory information see the getting started section in this documentation.
The simple Docker image has been presented in the getting started section already. It allows to store data in Apache Parquet file, but makes no use of other storage engines or distribution. For starting the simple deployment run
docker run --name adampro -p 5890:5890 -p 9099:9099 -p 4040:4040 -d vitrivr/adampro:latest
where port 5890 denotes the port to connect with an application to ADAMpro, port 9099 serves the ADAMpro web UI and port 4040 is for the Spark web UI.
Our Docker containers come with an update script located at /adampro/update.sh
, which allows you to check out the newest version of the code from the repository and re-build the jars without creating a new container (and therefore loosing existing data). To run the update routine, run in your host system:
docker exec adampro /adampro/update.sh
Note that the Docker container makes use of a number of environment variables which can be adjusted, e.g., for better performance. Note in particular ADAMPRO_MEMORY
which is set to 2g
. A few other environment variables, e.g., ADAMPRO_START_WEBUI
, can be used to denote whether certain parts (in this case the web UI) of ADAMpro should be started or not.
We have presented how to start a minimal Docker container. For self-container containers which come with PostgreSQL and solr within the container, you may use
docker run --name adampro -p 4040:4040 -p 5890:5890 -p 9099:9099 -d vitrivr/adampro:latest-selfcontained
where port 4040 serves the Spark web UI and port 9099 the ADAMpro UI. Port 5890 allows to connect using the grpc interface to ADAMpro.
For demo purposes, this container can be filled with data. We provide the OSVC data for download. Untar the folder and copy to adampro/data
; then restart the container.
docker cp osvc.tar.gz adampro:/adampro/data
docker exec adampro tar -C /adampro/data/ -xzf /adampro/data/osvc.tar.gz
docker restart adampro
ADAMpro can be built using sbt. We provide various sbt tasks to simplify the deployment and development.
-
assembly
creates a fat jar with ADAMpro to submit to spark, runsbt proto
first -
proto
generates a jar file from the grpc folder and includes it in the main project (this is necessary, as shadowing is necessary of the netty dependency)
Other helpful sbt tasks include
-
dependencyTree
displays the dependencies to other packages -
stats
to show code statistics
Because of its project structure, for building ADAMpro, you have to first run
sbt proto
in the main folder, which generates the proto-files and creates a jar file containing the proto sources into the ./lib/
folder.
Running
sbt assembly
(and sbt web/assembly
for the ADAMpro UI), a jar file is created which can then be submitted to Apache Spark using
./spark-submit --master "local[4]" --driver-memory 2g --executor-memory 2g --class org.vitrivr.adampro.main.Startup $ADAM_HOME/ADAMpro-assembly-0.1.0.jar
ADAMpro can also be started locally, e.g., from an IDE. For this, remove the % "provided"
statements from build.sbt
and the marked line ExclusionRule("io.netty")
, and run the main class org.vitrivr.adampro.main.Startup
. You can use
sbt run
for running ADAMpro, as well. Note that the storage engines specified in the configuration have to be running already or you have to adjust the config file accordingly.
We provide a number of docker-compose scripts which can be used to setup the full environment. Check out the ADAMpro repository and go to the folder scripts/docker
and run
docker-compose up
This will start up a master and a single worker node and separate containers for PostgreSQL, Apache Cassandra and Apache solr able to communicate with each other. To add more workers (note that the number of masters is limited to 1), run the scale
command and specify the number of workers you would like to deploy in total:
docker-compose scale worker=5
Note that this setup will not use Hadoop for creating a HDFS, but will rather just mount a folder to all Docker containers (both master and worker container). Therefore this deployment will only work if all containers run on one single machine.
ADAMpro can be started in a virtually distributed setup using docker-compose. The folder scripts/docker-hdfs
contains a docker-compose.yml
; move into the docker-hdfs
folder and run:
docker-compose up
This will start up a master and a single worker node. Note that using the scale
command of docker-compose
you may create multiple workers; however, the number of master nodes (and Hadoop name nodes) is limited to 1.
ADAMpro can be started in a truly distributed setup with independent nodes connected via network. On the master node run
export ENV_ADAMPRO_MASTER_HOSTNAME=$HOSTNAME
export ENV_ADAMPRO_MEMORY= #choose a certain amount of memory
docker run --name adamproHDFSMaster --network=host -p 2122:2122 -p 4040:4040 -p 5005:5005 -p 5432:5432 -p 5890:5890 -p 6066:6066 -p 7001:7001 -p 7002:7002 -p 7003:7003 -p 7004:7004 -p 7005:7005 -p 7006:7006 -p 7077:7077 -p 8020:8020 -p 8030:8030 -p 8031:8031 -p 8032:8032 -p 8080:8080 -p 8088:8088 -p 8983:8983 -p 9000:9000 -p 9099:9099 -p 19888:19888 -p 38000:38000 -p 39000:39000 -p 50010:50010 -p 50020:50020 -p 50070:50070 -p 50075:50075 -p 50090:50090 -e "ADAMPRO_DRIVER_MEMORY=$ENV_ADAMPRO_MEMORY" -e "ADAMPRO_MASTER=spark://$ENV_ADAMPRO_MASTER_HOSTNAME:7077" -e "ADAMPRO_MASTER_HOSTNAME=$ENV_ADAMPRO_MASTER_HOSTNAME" -e "ADAMPRO_EXECUTOR_MEMORY=$ENV_ADAMPRO_MEMORY" -e "SPARK_PUBLIC_DNS=localhost" --entrypoint="/adampro/bootstrap.sh" -d vitrivr/adampro:latest-hdfs -d --masternode
where you should set the amount of memory to use. On the worker nodes run
export ENV_ADAMPRO_MASTER_HOSTNAME= #set host name of master host
export ENV_ADAMPRO_MASTER_IP= #set IP address of master host
export ENV_ADAMPRO_MEMORY= #choose a certain amount of memory
docker run --name adamproHDFSWorker --network=host -p 2122:2122 -p 7012:7012 -p 7013:7013 -p 7014:7014 -p 7015:7015 -p 7016:7016 -p 8020:8020 -p 8030:8030 -p 8031:8031 -p 8032:8032 -p 8081:8081 -p 8881:8881 -p 9000:9000 -p 38000:38000 -p 39000:39000 -p 50010:50010 -p 50020:50020 -p 50070:50070 -p 50075:50075 -e "ADAMPRO_MASTER=spark://$ENV_ADAMPRO_MASTER_HOSTNAME:7077" -e "ADAMPRO_MASTER_HOSTNAME=$ENV_ADAMPRO_MASTER_HOSTNAME" -e "ADAMPRO_EXECUTOR_MEMORY=$ENV_ADAMPRO_MEMORY" -e "SPARK_WORKER_INSTANCES=1" --add-host $ENV_ADAMPRO_MASTER_HOSTNAME:$ENV_ADAMPRO_MASTER_IP --entrypoint="/adampro/bootstrap.sh" -d vitrivr/adampro:latest-hdfs -d --workernode
where you should set the hostname and the ip address of the master node, and the amount of memory to use.
To open up the web UI of ADAMpro go to the master node. Port 9099 will serve, as usual the web UI. Similarly, for connecting via grpc to ADAMpro, connect to the master.
Consider the official documentation and this inofficial documentation for more information on how to use the images with Docker swarm and how to set up a Docker swarm cluster.