Skip to content
Ivan edited this page Dec 15, 2017 · 11 revisions

ADAMpro documentation

ADAMpro is the persistent polystore (based on Apache Spark) for all data required for retrieval.

Getting started

In the following, we present how to easily setup and run ADAMpro. For this, we make use of the Docker image released on Docker Hub.

Pull the image using

docker pull vitrivr/adampro:latest

or run directly using (with the recommended ports being opened):

docker run --name adampro -p 5890:5890 -p 9099:9099 -p 4040:4040 -d vitrivr/adampro:latest

where port 5890 denotes the port to connect with an application to ADAMpro, port 9099 serves the ADAMpro web UI and port 4040 is for the Spark web UI.

After the creation of the container, you can navigate to

http://localhost:9099

to open the ADAMpro UI (and http://localhost:4040 for the Spark UI). Furthermore, you can connect on port 5890 to make use of the database.

Our Docker containers come with an update script located at /adampro/update.sh, which allows you to check out the newest version of the code from the repository and re-build the jars without creating a new container (and therefore loosing existing data). To run the update routine, run in your host system:

docker exec adampro /adampro/update.sh

Note that the Docker container makes use of a number of environment variables which can be adjusted, e.g., for better performance. Note in particular ADAMPRO_MEMORY which is set to 2g. A few other environment variables, e.g., ADAMPRO_START_WEBUI, can be used to denote whether certain parts of ADAMpro should be started or not.

Using the aforementioned options for creating the Docker container, will create a container which stores the data in files. Other Docker tags are available for containers which e.g., come with PostgreSQL and solr installed, or which use HDFS.

Configuration

Configuration files

ADAMpro can be configured using a configuration file. This repository contains a ./conf/ folder with configuration files.

  • application.conf is used when running ADAMpro from an IDE
  • assembly.conf is the conf file included in the assembly jar (when running sbt assembly)

When starting ADAMpro, you can provide a adampro.conf file in the same path as the jar, which is then used instead of the default configuration. (Note the file adampro.conf.template which is used as a template for the Docker container.)

Configuration parameters

The configuration file can be used to specify configurations for running ADAMpro. The file ADAMConfig.scala reads the configuration file and provides the configurations to the application.

The file contains information on

  • the path to all the internal files (catalog, etc.), e.g., internalsPath = "/adampro/internals"
  • the grpc port, e.g., grpc {port = "5890"}
  • the storage engines to use, e.g., engines = ["parquet", "index", "postgres", "postgis", "cassandra", "solr"]

For all the storage engines specified, in the storage section, more details have to be provided (note that the name specified in engines must match the name in the storage section):

  parquet {
    engine = "ParquetEngine"
    hadoop = true
    basepath = "hdfs://spark:9000/"
    datapath = "/adampro/data/"
  }

or

  parquet {
    engine = "ParquetEngine"
    hadoop = false
    path = "~/adampro-tmp/data/"
  }

The parameters specified in here are passed directly to the storage engines; it may make sense to consider the code of the single storage engine to see which parameters are necessary to specify (or to consider the exemplary configuration files in the configuration folder). The name of the class is specified in the field engine.

Code basis and Repository

ADAMpro builds on Apache Spark 2 and uses a large variety of libraries and packages, e.g. Google Protocol Buffers and grpc. The repository has the following structure:

  • conf folder for configuration files; note that the conf folder is automatically included to the resources
  • grpc the proto file (included from the proto sub-repository)
  • grpcclient general grpc client code for communicating with the grpc server
  • scripts useful scripts for deploying running ADAMpro
  • src ADAMpro sources
  • web web UI of ADAMpro

Development

Unit tests

ADAMpro comes with a set of unit tests which can be run from the test package. Note that for having all test pass, a certain setup is necessary. For instance, for having the PostGIS test pass, the database has to be set up and it must be configured in the configuration file. You may use the script setupLocalUnitTests.sh for setting up all the necessary Docker containers for then performing the unit tests.

Debugging

We recommend the use of IntelliJ IDEA for developing ADAMpro. It can be run locally using the run commands in the IDE for debugging purposes.

Note that the behaviour of ADAMpro, when run locally, might be different than when submitted to Apache Spark (using ./spark-submit), in particular because of the inclusion of different package versions (e.g., Apache Spark will come with a certain version of netty, which is used even if build.sbt includes a newer version; we refrain from using the spark.driver.userClassPathFirst option as this is experimental).

ADAMpro can be debugged even if being submitted to Spark. By setting the debugging option in the SPARK_SUBMIT_OPTS command before submitting, a remote debugger can be attached:

export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005

In here, we have opened port 5005 and, given the suspend option, have the application wait until a debugger attaches.

The Docker container we provide has the SPARK_SUBMIT_OPTS options set and we use port 5005 in the Docker containers provided for debugging (however, note that the suspend option which makes the application wait until a debugger attaches is turned off in the Docker container).

In your IDE, bind to the application by setting up remote debugging on the port specified. For more information on how to use remote debugging consider e.g., this article: https://community.hortonworks.com/articles/15030/spark-remote-debugging.html

Flame graphs

For checking the performance of ADAMpro, also consider the creation of flame graphs. For more information see here.

Deployment

For introductory information see the getting started section in this documentation.

Self-contained Docker container

We have presented how to start a minimal Docker container. For self-container containers which come with PostgreSQL and solr within the container, you may use

docker run --name adampro -p 5005:5005 -p 5890:5890 -p 9099:9099 -p 5432:5432 -p 9000:9000 -p 4040:4040 -d vitrivr/adampro:latest-selfcontained

For demo purposes, this container can be filled with data. We provide the OSVC data for download. Untar the folder and copy to adampro/data; then restart the container.

docker cp osvc.tar.gz adampro:/adampro/data
docker exec adampro tar -C /adampro/data/ -xzf /adampro/data/osvc.tar.gz 
docker restart adampro

Native deployment

ADAMpro can be built using sbt. We provide various sbt tasks to simplify the deployment and development.

  • assembly creates a fat jar with ADAMpro to submit to spark, run sbt proto first
  • proto generates a jar file from the grpc folder and includes it in the main project (this is necessary, as shadowing is necessary of the netty dependency)

Other helpful sbt tasks include

  • dependencyTree displays the dependencies to other packages
  • stats to show code statistics

Because of its project structure, for building ADAMpro, you have to first run

sbt proto

in the main folder, which generates the proto-files and creates a jar file containing the proto sources into the ./lib/ folder.

Running

sbt assembly

(and sbt web/assembly for the ADAMpro UI), a jar file is created which can then be submitted to Apache Spark using

./spark-submit --master "local[4]" --driver-memory 2g --executor-memory 2g --class org.vitrivr.adampro.main.Startup $ADAM_HOME/ADAMpro-assembly-0.1.0.jar

ADAMpro can also be started locally, e.g., from an IDE. For this, remove the % "provided" statements from build.sbt and the marked line ExclusionRule("io.netty"), and run the main class org.vitrivr.adampro.main.Startup. You can use

sbt run

for running ADAMpro, as well. Note that the storage engines specified in the configuration have to be running already or you have to adjust the config file accordingly.

Deployment using docker-compose

We provide a number of docker-compose scripts which can be used to setup the full environment. Check out the ADAMpro repository and go to the folder scripts/docker and run

docker-compose up

This will start up a master and a single worker node and separate containers for PostgreSQL, Apache Cassandra and Apache solr able to communicate with each other. To add more workers (note that the number of masters is limited to 1), run the scale command and specify the number of workers you would like to deploy in total:

docker-compose scale worker=5

Note that this setup will not use Hadoop for creating a HDFS, but will rather just mount a folder to all Docker containers (both master and worker container). Therefore this deployment will only work if all containers run on one single machine.

Virtually distributed deployment using HDFS with docker-compose

ADAMpro can be started in a virtually distributed setup using docker-compose. The folder scripts/docker-hdfs contains a docker-compose.yml; move into the docker-hdfs folder and run:

docker-compose up

This will start up a master and a single worker node. Note that using the scale command of docker-compose you may create multiple workers; however, the number of master nodes (and Hadoop name nodes) is limited to 1.

Distributed deployment using HDFS with Docker

ADAMpro can be started in a truly distributed setup with independent nodes connected via network. On the master node run

export ENV_ADAMPRO_MASTER_HOSTNAME=$HOSTNAME 
export ENV_ADAMPRO_MEMORY= #choose a certain amount of memory
docker run --name adamproHDFSMaster --network=host -p 2122:2122 -p 4040:4040 -p 5005:5005 -p 5432:5432 -p 5890:5890  -p 6066:6066 -p 7001:7001 -p 7002:7002 -p 7003:7003 -p 7004:7004 -p 7005:7005 -p 7006:7006 -p 7077:7077 -p 8020:8020 -p 8030:8030 -p 8031:8031 -p 8032:8032 -p 8080:8080 -p 8088:8088 -p 8983:8983  -p 9000:9000 -p 9099:9099 -p 19888:19888 -p 38000:38000 -p 39000:39000 -p 50010:50010 -p 50020:50020 -p 50070:50070 -p 50075:50075 -p 50090:50090 -e "ADAMPRO_DRIVER_MEMORY=$ENV_ADAMPRO_MEMORY" -e "ADAMPRO_MASTER=spark://$ENV_ADAMPRO_MASTER_HOSTNAME:7077"  -e "ADAMPRO_MASTER_HOSTNAME=$ENV_ADAMPRO_MASTER_HOSTNAME"  -e "ADAMPRO_EXECUTOR_MEMORY=$ENV_ADAMPRO_MEMORY"  -e "SPARK_PUBLIC_DNS=localhost" --entrypoint="/adampro/bootstrap.sh" -d vitrivr/adampro:latest-hdfs  -d --masternode

where you should set the amount of memory to use. On the worker nodes run

export ENV_ADAMPRO_MASTER_HOSTNAME= #set host name of master host
export ENV_ADAMPRO_MASTER_IP= #set IP address of master host
export ENV_ADAMPRO_MEMORY= #choose a certain amount of memory
docker run --name adamproHDFSWorker --network=host -p 2122:2122 -p 7012:7012 -p 7013:7013 -p 7014:7014 -p 7015:7015 -p 7016:7016 -p 8020:8020  -p 8030:8030  -p 8031:8031 -p 8032:8032 -p 8081:8081 -p 8881:8881 -p 9000:9000 -p 38000:38000 -p 39000:39000 -p 50010:50010 -p 50020:50020 -p 50070:50070 -p 50075:50075 -e "ADAMPRO_MASTER=spark://$ENV_ADAMPRO_MASTER_HOSTNAME:7077"  -e "ADAMPRO_MASTER_HOSTNAME=$ENV_ADAMPRO_MASTER_HOSTNAME" -e "ADAMPRO_EXECUTOR_MEMORY=$ENV_ADAMPRO_MEMORY"  -e "SPARK_WORKER_INSTANCES=1" --add-host $ENV_ADAMPRO_MASTER_HOSTNAME:$ENV_ADAMPRO_MASTER_IP --entrypoint="/adampro/bootstrap.sh" -d vitrivr/adampro:latest-hdfs -d --workernode

where you should set the hostname and the ip address of the master node, and the amount of memory to use.

Deployment using Docker swarm

Consider the official documentation and this inofficial documentation for more information on how to use the images with Docker swarm and how to set up a Docker swarm cluster.

Clone this wiki locally