First steps with Apache Spark on K8s (part #2): standalone Spark auto scaling with HPA

All you will read here is personal opinion or lack of knowledge :) Please feel free to contact me for fixing incorrect parts.

If you’re in this blogpost — most probably you know what Apache Spark and K8s is. Probably the most exciting feature which is possible under Kubernetes is auto scaling which allows to utilize resources more efficient. It’s possible to scale Spark in different manner, but most probably the most known is HPA. The more details can be found at: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/

picture taken from: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/

How to use HPA with Spark? The most efficient way is — Spark Operator, but in order to understand how to it works, let’s begin with standalone Spark. In this blogpost we will create HPA for worker node deployment.

If you have working environment with Kubernetes — it’s great, otherwise you can follow my previous post: First steps with apache spark on k8s preparation-of environment

I will begin with git clone the repo:

#1.git clone https://github.com/domasjautakis/standalone-spark-on-k8s.git

This repo consist of:

  • Main Dockerfile for master/worker container
  • Deployments for master and worker
  • Services — network service for Spark Web UI
  • Spark Proxy UI — which will allow us to get access to it outside of container. Module written by Alexis Seigneurin, more: github

Next step consist of Docker image build:

#2.cd standalone-spark-on-k8s/
docker build -f Dockerfile -t spark3:v1 .

Verify that image was created successfully

#3.docker image ls

Create spark-ui-proxy image and verify it:

#4.cd spark-ui-proxy/
docker build -f Dockerfile -t spark3proxy:v1 .
docker image ls

Let’s proceed with deployments:

#5.kubectl apply -f ../services/spark-master-service.yaml
kubectl apply -f ../deployments/spark-master-deployment.yaml
sleep 5
kubectl apply -f ../deployments/spark-worker-deployment.yaml
sleep 5
kubectl apply -f ../deployments/spark-ui-proxy/service.yaml
kubectl apply -f ../deployments/spark-ui-proxy/deployment.yaml

Verify that PODs and services are running:

#6.kubectl get po
kubectl get svc
minikube ip

Now you can easily connect to Spark UI via minikube ip and spark-ui-proxy port. In my case it’s 31136. It might be different with other runs.

In the next step we will create HPA for spark worker deployment. Before that please verify that spark-worker deployment exists. After creation check if you can see hpa as well. In our case we’re setting maxpods to 6 and deployment will scale according cpu factor. When cpu in worker node will hit 10% — new POD will be added. I have set small factor in order to scale fast.

#7. kubectl get deployments
kubectl autoscale deployment spark-worker --max=6 --cpu-percent=10
kubectl get hpa

In order to make some load, let’s connect to Spark worker POD and execute spark-pi app which is provided by default in example folder with input parameter 100000 of random samples.

#8. kubectl exec -it "$(kubectl get pods | grep "spark-worker" | cut -d ' ' -f 1)" -- /bin/bash
spark-submit --master spark://spark-master:7077 --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.enabled=true --class org.apache.spark.examples.SparkPi /opt/spark/examples/jars/spark-examples_2.12-3.0.2.jar 100000

And magic starts. Be patient as HPA uses metrics-server it take cpu avg usage values so it might take some time. Duplicate ssh session and check hpa status and pods:

#9.kubectl get hpa

Verify in Spark UI:

Wait while application will finish or stop it manually. It will take some time to scale it down.

Conclusion: it’s not efficient to run Spark in Standalone mode at Kubernetes, but it’s pretty awesome to understand how it scales. In next parts I will try to go through “vertical” and horizontal scaling in available spark operators.

Data Engineer