Skip to content

Latest commit

 

History

History
254 lines (173 loc) · 7.99 KB

README.md

File metadata and controls

254 lines (173 loc) · 7.99 KB

Spark On Kubernetes

alt text

Spark On Kubernetes via helm chart

The control-plane & worker nodes addresses are :

192.168.56.115
192.168.56.116
192.168.56.117

alt text

Kubernetes cluster nodes :

alt text

you can install helm via the link helm :


The Steps :

  1. Install spark via helm chart (bitnami) :

$ helm repo add bitnami https://charts.bitnami.com/bitnami
$ helm search repo bitnami
$ helm install kayvan-release oci://registry-1.docker.io/bitnamicharts/spark
$ helm upgrade kayvan-release bitnami/spark --set worker.replicaCount=5

the installed 6 pods :

alt text

and Services (headless for statefull) :

alt text

and the spark master ui is :

alt text


  1. type the below commands on kubernetes kube-apiserver :
kubectl exec -it  kayvan-release-spark-master-0 -- ./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077 \
  ./examples/jars/spark-examples_2.12-3.4.1.jar 1000

or

kubectl exec -it  kayvan-release-spark-master-0 -- /bin/bash

./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077 \
  ./examples/jars/spark-examples_2.12-3.4.1.jar 1000


./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077 \
  ./examples/src/main/python/pi.py 1000


./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077 \
  ./examples/src/main/python/wordcount.py //filepath

alt text

alt text

the exact scala & python code of spark-examples_2.12-3.4.1.jar , pi.py & wordcount.py :

examples/src/main/scala/org/apache/spark/examples/SparkPi.scala

examples/src/main/python/pi.py

examples/src/main/python/wordcount.py


  1. The final result is 🍹 :

for scala :

alt text

for python :

alt text

alt text


The other python Python Programm :

  1. Copy People.csv (large file) inside spark worker pods :

alt text

kubectl cp people.csv kayvan-release-spark-worker-{x}:/opt/bitnami/spark

Notes:

  • you can download the file from link
  • you can also use a nfs share folder for read large csv file from it instead of copying it inside pods.
  1. Write some python codes inside readcsv.py please :
from pyspark.sql import SparkSession
#from pyspark.sql.functions import sum
from pyspark.context import SparkContext

spark = SparkSession\
            .builder\
            .appName("Mahla")\
            .getOrCreate()
        

sc = spark.sparkContext

path = "people.csv"

df = spark.read.options(delimiter=",", header=True).csv(path)

df.show()

#df.groupBy("Job Title").sum().show() 

df.createOrReplaceTempView("Peopletable")
df2 = spark.sql("select Sex, count(1) countsex, sum(Index) sex_sum " \
                "from peopletable group by Sex")
df2.show()

#df.select(sum(df.Index)).show()

alt text

  1. copy readcsv.py file inside spark master pod :
kubectl cp readcsv.py kayvan-release-spark-master-0:/opt/bitnami/spark
  1. run the code :
kubectl exec -it  kayvan-release-spark-master-0 -- ./bin/spark-submit   --class org.apache.spark.examples.SparkPi
      --master spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077 
        readcsv.py
  1. showing some data :

alt text

  1. the next result data:

alt text

  1. the time consuming for processing :

alt text


The other python Python Programm on Docker Desktop :

docker-compose.yml :

version: '3.6'

services:

  spark:
    container_name: spark
    image: bitnami/spark:latest
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=root   
      - PYSPARK_PYTHON=/opt/bitnami/python/bin/python3
    ports:
      - 127.0.0.1:8081:8080

  spark-worker:
    image: bitnami/spark:latest
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=2
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=root
      - PYSPARK_PYTHON=/opt/bitnami/python/bin/python3
docker-compose up --scale spark-worker=2

alt text

copy required files to containers :

for e.g.

docker cp file.csv spark-worker-1:/opt/bitnami/spark

python code on master :

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Writingjson").getOrCreate()

df = spark.read.option("header", True).csv("csv/file.csv").coalesce(2)

df.show()

df.write.partitionBy('name').mode('overwrite').format('json').save('file_name.json')

run the code on spark master docker container :

./bin/spark-submit --master spark://4f28330ce077:7077 csv/ctp.py

showing some data :

alt text

and the seperated json files based on name partitioning :

alt text

data for name=kayvan :

alt text