GitHub

SF Crime Project

Question 1

How did changing values on the SparkSession property parameters affect the throughput and latency of the data?

It either increased or decreased processedRowsPerSecond

Question 2

What were the 2-3 most efficient SparkSession property key/value pairs? Through testing multiple variations on values, how can you tell these were the most optimal?

The options with the most effect was the following, set to the respective values. The meassure I used to guage the effect was processedRowsPerSecond. In Apache Spark, shuffle is one of costliest operation. Effective parallelising of this operation gives good performing for spark jobs. So I changed the following properties. Since we are dealing with a small dataset, 200 is a overkill.

spark.sql.shuffle.partitions 30
spark.streaming.kafka.maxRatePerPartition 30

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
screenshots		screenshots
README.md		README.md
data_stream.py		data_stream.py
kafka_server.py		kafka_server.py
producer_server.py		producer_server.py
radio_code.json		radio_code.json
requirements.txt		requirements.txt
server.properties		server.properties
start.sh		start.sh
zookeeper.properties		zookeeper.properties

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

happyorsaad/DataStreamingProject2

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages