Home

ADAMpro documentation

ADAMpro is the persistent polystore (based on Apache Spark) for all data required for retrieval.

Quick start – Getting started

In the following, we present how to easily setup and run ADAMpro. For this, we make use of the Docker image released on Docker Hub.

Pull the image using

docker pull vitrivr/adampro:latest

or run directly using (with the recommended ports being opened):

docker run --name adampro -p 5890:5890 -p 9099:9099 -p 4040:4040 -d vitrivr/adampro:latest

where port 5890 denotes the port to connect with an application to ADAMpro, port 9099 serves the ADAMpro web UI and port 4040 is for the Spark web UI.

After the creation of the container, you can navigate to

http://localhost:9099

to open the ADAMpro UI (and http://localhost:4040 for the Spark UI). Furthermore, you can connect on port 5890 to make use of the database.

Using the aforementioned options for creating the Docker container, will create a container which stores the data in Apache Parquet files. Other Docker tags are available for containers which e.g., come with PostgreSQL and solr installed, or which use HDFS.

Configuration

Configuration files

ADAMpro can be configured using a configuration file. This repository contains a ./conf/ folder with configuration files.

application.conf is used when running ADAMpro from an IDE
assembly.conf is the conf file included in the assembly jar (when running sbt assembly)

When starting ADAMpro, you can provide a adampro.conf file in the same path as the jar, which is then used instead of the default configuration. (Note the file adampro.conf.template which is used as a template for the Docker container.)

Configuration parameters

The configuration file can be used to specify configurations for running ADAMpro. The file ADAMConfig.scala reads the configuration file and provides the configurations to the application.

The file contains information on

the path to all the internal files (catalog, etc.), e.g., internalsPath = "/adampro/internals"
the grpc port, e.g., grpc {port = "5890"}
the storage engines to use, e.g., engines = ["parquet", "index", "postgres", "postgis", "cassandra", "solr"]

For all the storage engines specified, in the storage section, more details have to be provided (note that the name specified in engines must match the name in the storage section):

  parquet {
    engine = "ParquetEngine"
    hadoop = true
    basepath = "hdfs://spark:9000/"
    datapath = "/adampro/data/"
  }

or

  parquet {
    engine = "ParquetEngine"
    hadoop = false
    path = "~/adampro-tmp/data/"
  }

The parameters specified in here are passed directly to the storage engines; it may make sense to consider the code of the single storage engine to see which parameters are necessary to specify (or to consider the exemplary configuration files in the configuration folder). The name of the class is specified in the field engine.

Code basis and Repository

ADAMpro builds on Apache Spark 2 and uses a large variety of libraries and packages, e.g. Google Protocol Buffers and grpc. The repository has the following structure:

conf folder for configuration files; note that the conf folder is automatically included to the resources
grpc the proto file (included from the proto sub-repository)
grpcclient general grpc client code for communicating with the grpc server
scripts useful scripts for deploying running ADAMpro
src ADAMpro sources
web web UI of ADAMpro

Development

Unit tests

ADAMpro comes with a set of unit tests which can be run from the test package. Note that for having all test pass, a certain setup is necessary. For instance, for having the PostGIS test pass, the database has to be set up and it must be configured in the configuration file. You may use the script setupLocalUnitTests.sh for setting up all the necessary Docker containers for then performing the unit tests.

Debugging

We recommend the use of IntelliJ IDEA for developing ADAMpro. It can be run locally using the run commands in the IDE for debugging purposes.

Note that the behaviour of ADAMpro, when run locally, might be different than when submitted to Apache Spark (using ./spark-submit), in particular because of the inclusion of different package versions (e.g., Apache Spark will come with a certain version of netty, which is used even if build.sbt includes a newer version; we refrain from using the spark.driver.userClassPathFirst option as this is experimental).

ADAMpro can be debugged even if being submitted to Spark. By setting the debugging option in the SPARK_SUBMIT_OPTS command before submitting, a remote debugger can be attached:

export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005

In here, we have opened port 5005 and, given the suspend option, have the application wait until a debugger attaches.

The Docker container we provide has the SPARK_SUBMIT_OPTS options set and we use port 5005 in the Docker containers provided for debugging (however, note that the suspend option which makes the application wait until a debugger attaches is turned off in the Docker container).

In your IDE, bind to the application by setting up remote debugging on the port specified. For more information on how to use remote debugging consider e.g., this article: https://community.hortonworks.com/articles/15030/spark-remote-debugging.html

Flame graphs

For checking the performance of ADAMpro, also consider the creation of flame graphs. For more information see here.

Deployment

For introductory information see the getting started section in this documentation.

Simple Docker container

The simple Docker image has been presented in the getting started section already. It allows to store data in Apache Parquet file, but makes no use of other storage engines or distribution. For starting the simple deployment run

docker run --name adampro -p 5890:5890 -p 9099:9099 -p 4040:4040 -d vitrivr/adampro:latest

where port 5890 denotes the port to connect with an application to ADAMpro, port 9099 serves the ADAMpro web UI and port 4040 is for the Spark web UI.

Our Docker containers come with an update script located at /adampro/update.sh, which allows you to check out the newest version of the code from the repository and re-build the jars without creating a new container (and therefore loosing existing data). To run the update routine, run in your host system:

docker exec adampro /adampro/update.sh

Note that the Docker container makes use of a number of environment variables which can be adjusted, e.g., for better performance. Note in particular ADAMPRO_MEMORY which is set to 2g. A few other environment variables, e.g., ADAMPRO_START_WEBUI, can be used to denote whether certain parts (in this case the web UI) of ADAMpro should be started or not.

Self-contained Docker container

We have presented how to start a minimal Docker container. For self-container containers which come with PostgreSQL and solr within the container, you may use

docker run --name adampro -p 4040:4040 -p 5890:5890 -p 9099:9099 -d vitrivr/adampro:latest-selfcontained

where port 4040 serves the Spark web UI and port 9099 the ADAMpro UI. Port 5890 allows to connect using the grpc interface to ADAMpro.

For demo purposes, this container can be filled with data. We provide the OSVC data for download. Untar the folder and copy to adampro/data; then restart the container.

docker cp osvc.tar.gz adampro:/adampro/data
docker exec adampro tar -C /adampro/data/ -xzf /adampro/data/osvc.tar.gz 
docker restart adampro

Native deployment

ADAMpro can be built using sbt. We provide various sbt tasks to simplify the deployment and development.

assembly creates a fat jar with ADAMpro to submit to spark, run sbt proto first
proto generates a jar file from the grpc folder and includes it in the main project (this is necessary, as shadowing is necessary of the netty dependency)

Other helpful sbt tasks include

dependencyTree displays the dependencies to other packages
stats to show code statistics

Because of its project structure, for building ADAMpro, you have to first run

sbt proto

in the main folder, which generates the proto-files and creates a jar file containing the proto sources into the ./lib/ folder.

Running

sbt assembly

(and sbt web/assembly for the ADAMpro UI), a jar file is created which can then be submitted to Apache Spark using

./spark-submit --master "local[4]" --driver-memory 2g --executor-memory 2g --class org.vitrivr.adampro.main.Startup $ADAM_HOME/ADAMpro-assembly-0.1.0.jar

ADAMpro can also be started locally, e.g., from an IDE. For this, remove the % "provided" statements from build.sbt and the marked line ExclusionRule("io.netty"), and run the main class org.vitrivr.adampro.main.Startup. You can use

sbt run

for running ADAMpro, as well. Note that the storage engines specified in the configuration have to be running already or you have to adjust the config file accordingly.

Deployment using docker-compose

We provide a number of docker-compose scripts which can be used to setup the full environment. Check out the ADAMpro repository and go to the folder scripts/docker and run

docker-compose up

This will start up a master and a single worker node and separate containers for PostgreSQL, Apache Cassandra and Apache solr able to communicate with each other. To add more workers (note that the number of masters is limited to 1), run the scale command and specify the number of workers you would like to deploy in total:

docker-compose scale worker=5

Note that this setup will not use Hadoop for creating a HDFS, but will rather just mount a folder to all Docker containers (both master and worker container). Therefore this deployment will only work if all containers run on one single machine.

Virtually distributed deployment using HDFS with docker-compose

ADAMpro can be started in a virtually distributed setup using docker-compose. The folder scripts/docker-hdfs contains a docker-compose.yml; move into the docker-hdfs folder and run:

docker-compose up

This will start up a master and a single worker node. Note that using the scale command of docker-compose you may create multiple workers; however, the number of master nodes (and Hadoop name nodes) is limited to 1.

Distributed deployment using HDFS with Docker

ADAMpro can be started in a truly distributed setup with independent nodes connected via network. On the master node run

export ENV_ADAMPRO_MASTER_HOSTNAME=$HOSTNAME 
export ENV_ADAMPRO_MEMORY= #choose a certain amount of memory
docker run --name adamproHDFSMaster --network=host -p 2122:2122 -p 4040:4040 -p 5005:5005 -p 5432:5432 -p 5890:5890  -p 6066:6066 -p 7001:7001 -p 7002:7002 -p 7003:7003 -p 7004:7004 -p 7005:7005 -p 7006:7006 -p 7077:7077 -p 8020:8020 -p 8030:8030 -p 8031:8031 -p 8032:8032 -p 8080:8080 -p 8088:8088 -p 8983:8983  -p 9000:9000 -p 9099:9099 -p 19888:19888 -p 38000:38000 -p 39000:39000 -p 50010:50010 -p 50020:50020 -p 50070:50070 -p 50075:50075 -p 50090:50090 -e "ADAMPRO_DRIVER_MEMORY=$ENV_ADAMPRO_MEMORY" -e "ADAMPRO_MASTER=spark://$ENV_ADAMPRO_MASTER_HOSTNAME:7077"  -e "ADAMPRO_MASTER_HOSTNAME=$ENV_ADAMPRO_MASTER_HOSTNAME"  -e "ADAMPRO_EXECUTOR_MEMORY=$ENV_ADAMPRO_MEMORY"  -e "SPARK_PUBLIC_DNS=localhost" --entrypoint="/adampro/bootstrap.sh" -d vitrivr/adampro:latest-hdfs  -d --masternode

where you should set the amount of memory to use. On the worker nodes run

export ENV_ADAMPRO_MASTER_HOSTNAME= #set host name of master host
export ENV_ADAMPRO_MASTER_IP= #set IP address of master host
export ENV_ADAMPRO_MEMORY= #choose a certain amount of memory
docker run --name adamproHDFSWorker --network=host -p 2122:2122 -p 7012:7012 -p 7013:7013 -p 7014:7014 -p 7015:7015 -p 7016:7016 -p 8020:8020  -p 8030:8030  -p 8031:8031 -p 8032:8032 -p 8081:8081 -p 8881:8881 -p 9000:9000 -p 38000:38000 -p 39000:39000 -p 50010:50010 -p 50020:50020 -p 50070:50070 -p 50075:50075 -e "ADAMPRO_MASTER=spark://$ENV_ADAMPRO_MASTER_HOSTNAME:7077"  -e "ADAMPRO_MASTER_HOSTNAME=$ENV_ADAMPRO_MASTER_HOSTNAME" -e "ADAMPRO_EXECUTOR_MEMORY=$ENV_ADAMPRO_MEMORY"  -e "SPARK_WORKER_INSTANCES=1" --add-host $ENV_ADAMPRO_MASTER_HOSTNAME:$ENV_ADAMPRO_MASTER_IP --entrypoint="/adampro/bootstrap.sh" -d vitrivr/adampro:latest-hdfs -d --workernode

where you should set the hostname and the ip address of the master node, and the amount of memory to use.

To open up the web UI of ADAMpro go to the master node. Port 9099 will serve, as usual the web UI. Similarly, for connecting via grpc to ADAMpro, connect to the master.

Deployment using Docker swarm

Consider the official documentation and this inofficial documentation for more information on how to use the images with Docker swarm and how to set up a Docker swarm cluster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly