Skip to content

Docker Images for the Virgo Spark Cluster. Distribution including HDFS, YARN, Hive, Spark 2.3+

License

Notifications You must be signed in to change notification settings

AiurTech/virgo-spark-cluster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Virgo Spark Cluster

Simplifies building and testing applications using Spark 2.3+. This cluster setup focuses primarily on Spark with Hive integration.

Components

  • Spark with external Hive Metastore (Postgres)
  • YARN
  • HDFS
  • Hive (same version as required by Spark)
  • Spark History Server

The cluster is integrated in such a way that it correctly handles all dependencies and it's expected to work correctly out of the box.

The main benefits of this small cluster is that it's easy to configure to run integration tests with YARN cluster support on your own machine.

Versions

Virgo cluster Hadoop Spark Hive Postgres Livy
0.8.2 2.7.7 2.3.0 1.2.2 11 Moved
0.7.5 2.7.7 2.3.0 1.2.2 9.5 Moved
0.7.0 2.7.7 2.3.0 1.2.2 9.5 0.4
0.6.2 2.7.7 2.2.3 1.2.2 9.5 0.4
0.5.7 2.7.7 2.2.3 1.2.2 9.5

Use ✨ ✳️ 💫

To use, clone this repo, and use any of two forms:

docker-compose up -d

or just Docker:

./run-cluster.sh

To stop the cluster:

docker-compose down

or Just docker

./stop-cluster.sh

The folder virgo-client contains several useful clients to test the cluster:

  • Spark Submit with YARN cluster mode
  • Spark Submit with YARN client mode
  • Remote Spark Shell via YARN master
  • Remote Hive Beeline Shell

Comparison with other commercial distributions:

Advantages:

  • The docker images are just over 1 GB vs 21 GB for HDP. They reuse base images extensively.
  • Simple to use docker images. No special docker privileges required
  • Focus on ease of use versus a large set of components
  • Full micro-services stack: It offers 10 components in independent images, which makes debugging easier.
  • Requires a minimum of 2GB of RAM to run all containers. This is significantly less than the 10 GB required by HDP.
  • Limits itself to a maximum of 8 GB of RAM (total size of all containers)
  • "Fast" startup time, it can boot up fully in under 2 minutes, which is several times faster than full distros.
  • Biased towards Spark, instead of Hadoop
  • Aims to support Kubernetes deployment soon.

Disadvantages

  • It aims to provide a realistic cluster setup for development phase, not to substitute a full production cluster distribution.
  • Whilst security can be added, is not the main focus at this point.
  • No admin console

Full Distros

Whilst there are several commercial full distributions which offer a fully managed hadoop cluster, including Spark, they bundle at least another 30 components, several of which are out of date or not relevant in many workflows:

This project started as an attempt to use the images kindly provided by the Big Data Europe 2020 Project. However, we've found the images not suitable since they did not integrate Spark with Hive correctly. Furthermore, those images are no longer supported.

Commercial support available

Please contact Aiur Tech [cto @ aiur.co.uk]

FAQ

Why the name Virgo?

The Virgo Cluster is a "neighbouring" star cluster. It has some beatiful members:

Interestingly enough, soon after this project was created, the first ever picture of a black hole emerged, which was no other than M87 😃

About

Docker Images for the Virgo Spark Cluster. Distribution including HDFS, YARN, Hive, Spark 2.3+

Resources

License

Stars

Watchers

Forks

Packages

No packages published