Skip to content

Crusty-Banana/Streaming-Data-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deploy data pipeline system

First turn up docker compose using

docker compose up -d

You can turn off docker compose when done

docker compose down -v

Stream content to topic

Then stream content to kafka from your computer

python Kafka/kafka_producer.py

Process Data

When you want to process data from HDFS using Spark, run

docker exec -it spark-master sh -c "spark-submit --master spark://spark-master:7077 /opt/bitnami/spark/process_data.py"

Save to local computer

When you want to download processed data from HDFS to local computer at /Hadoop, run

docker exec -it namenode sh -c "rm -rf /Hadoop/output_zone && hdfs dfs -copyToLocal /output_zone /Hadoop"

Some useful command

python Kafka/kafka_consumer.py
docker exec -it namenode sh -c "hdfs dfs -rm -r /output/*"
docker exec -it namenode sh -c "hdfs dfs -rm -r /raw_zone/fact/activity/*"

GUI location

Nifi: http://localhost:8080/nifi Hadoop: http://localhost:9870 Spark: http://localhost:9090

Keeping the code End Of Line Sequence LF to be consistent with Linux VM

git config --global core.autocrlf input
git config --global core.eol lf
git add --renormalize .
git commit -m "Normalize line endings to LF"
git push origin master

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published