pipeline

I dream of being a real-time streaming pipeline demo when I grow up...

GOALS

My goals are to use this project to create a real-time streaming application inside Docker containers using the following technologies and workflows:

Input data stream, starting with the output from something like vmstat, or similar
Kafka to buffer the messages
Spark Streaming to handle streaming data into Spark
Spark for distributed computations
Cassandra ( or other persistant storage ) to hold the output
Node.js Server needed for angular
Angular for enhanced web frameworks
D3.js to display realtime visualizations
Socket.IO for direct to d3.js real-time data.

NOTES

Angular2

npm install needs to be run outside of the Dockerfile first when using persistant storage the way I am

I think I need to modify the file structure so that when npm install is run from Dockerfile, it isn't getting overwritten by my bind mount to the host system

In any case, this is the history that seems to be working at this point:

npm cache clean
npm install
docker build -t kettlewell:angular2 .
docker run -it -p 3000:3000 -p 3001:3001 \
       -v /Users/mkettlewell/git/pipeline/angular2/:/angular2/ \
       --name ng2 kettlewell:angular2

Next steps:

decouple npm install from the host system -- DONE?
get this working with docker-compose.yml -- DONE
Create a script that can run all the tests
- socket.io
- real-time spark
- regular spark Just a simple script
Pull in images for the following, and understand how each works / connects:
- kafka
- spark
- cassandra
- socket.io
Create dev/prod environments for how the code gets mounted IE. mount host system in dev, don't mount it in prod prod would be for end-users, dev would be for testing/building this pipeline EDIT: Just get a dev environment going for now...

To further refine the above:

First get socket.io -> angular2 working
After that get the kafka -> spark -> file -> angular2 working
Then get kafka -> spark -> cassandra -> angular2 working
Then get kafka -> streaming spark -> cassandra -> angular2 working

Also...

Might need to create a number pumper of random numbers to simulate RT data.

Alternatively, could use something like iostat, vmstat, sar or something in /proc that creates new data every few seconds

Appended Notes:

So after doing some research and thinking about things, it's clear to me that this project ( and most data pipelines really ) follow this path:

Input Source
Message Queue / Buffer
Processing
Output / Results Storage
Display

This suggests to me that I could have a container for each stage in the processing pipeline, with each stage being capable of being horizontally scalable to add additional compute / storage nodes, but generally not needed for a prototyping system like this.

So thinking out loud...

I think that I can create a directory for each stage, and inside each stage we can address the specifics of which component(s) are to be built on each container.

So rough draft outline:

Input a. static files b. random number stream c. streaming system data
Queue a. Kafka
Processing a. Spark b. Spark Streaming
Output a. files b. Cassandra
Display a. node.js b. express server c. angular d. d3.js

To me, the idea to build this up, would be to start small with the static/hard coded data first, roughly in this order:

Display framework Get an express server running that has the ability to work with routes and create some stub routes: a. / ( list of all links available ) b. /socket.io/monitoring c. /socket.io/random-numbers d. /spark/cassandra/census-data e. /spark/disk/census-data f. /spark-streaming/disk/random-numbers g. /spark-streaming/cassandra/random-numbers h. /spark-streaming/disk/monitoring i. /spark-streaming/cassandra/monitoring

OK... I like this ... the 'file' keyword was changed to 'disk' and I changed the placeholder 'static-file' to 'census-data' so that it resembles a real data set, and .. voila!

Those look like 8 good starting points to test enough scenarios to get a pretty good feel for how everything works together in a variety of fashions.

Edit: One last thought on this ... might be easier to just have /demo1, /demo2, etc, and an index page that lists out what each URL is accomplishing... probably the easiest and least complicated way.
So after the Display framework is started, work with the socket.io framework, to get the data flowing from Input to Display. The point of this excercise isn't really to process data, but to show that data can stream from once source to another and be displayed in real-time. I have some of the monitoring socket.io tested, so let's start with that and get it displaying through a d3.js framework. Don't worry about the fine points of the display yet. That will be very last.
Next we want to work on the Spark -> File -> Display We will need to create Spark and Storage docker containers, and make sure that Spark can read data from the Input container, and write data to the Output container.

Need a way to submit jobs to spark. From the Input container? Or.. ???

Need to select a data file ( census data would work ), and create a mock analysis ( mead, std dev would work to start ) just to get the concept running.
Update the Angular app to read the Spark data from the Output directory and display it in a d3.js chart.
Spark streaming will require a message buffer, so need to build the Queue container with Kafka in it. And get the pipeline going from Input -> Queue -> Spark Streaming -> File -> Display

Notes on minimal images

After spending a few days ( week ) on working with a minimal image of centos, for this project it's just not worth the effort because it was getting negated with the dependencies of all the tools we're installing ( spark / kafka / R / cassandra / etc )

So for this project, I'll just use the base centos:centos7 image

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Display		Display
Image		Image
Input		Input
Output		Output
Processing		Processing
Queue		Queue
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose-centos7.yml		docker-compose-centos7.yml
docker-compose-microyum-min.yml		docker-compose-microyum-min.yml
docker-compose-yum-min.yml		docker-compose-yum-min.yml
docker-compose.yml		docker-compose.yml
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pipeline

GOALS

NOTES

Angular2

Appended Notes:

Notes on minimal images

About

Releases

Packages

Languages

License

kettlewell/pipeline

Folders and files

Latest commit

History

Repository files navigation

pipeline

GOALS

NOTES

Angular2

Appended Notes:

Notes on minimal images

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages