First steps with Apache Spark on K8s (part #3): GCP Spark Operator and auto scaling
All you will read here is personal opinion or lack of knowledge :) Please feel free to contact me for fixing incorrect parts.
In previous blogposts I have described how to setup environment for Kubernetes and how to auto scale Standalone Spark. In this post let’s have a look on GCP Spark Operator. Join my journey with K8s and Spark! Operator versions shown below:
Installation is pretty simple, just use helm. I’m not providing namespace as using default one:
#1.helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operatorhelm install gcp-spark-operator spark-operator/spark-operator
Double check if deployment was successful:
#2.kubectl get deployments
kubectl get po
To understand how horizontal auto scaling works in GCP Spark Operator I will use several scripts:
#3.git clone https://github.com/domasjautakis/spark-operator-gcp
As the first step service account, cluster role binding and then operator should be created:
#4.kubectl apply -f spark-operator-gcp/1.rbac.yaml
kubectl apply -f spark-operator-gcp/2.spark-app-gcpoperator.yaml
Horizontal auto scaling
Check example application:
#5.more spark-operator-gcp/3.spark-app-gcpoperator-scaled.yaml
What is the difference between provided example and this small dummy app?
The dynamicAllocation block is added in specification, which describes preferences of scaling. This block actually describes horizontal auto scaling parameters. In this case it’s set from 2 to 10 executors as required. Also arguments block is added which is used for input parameters. 10 — is relatively small random sample input parameter for this app, so it will scale, but not to the maximum replicas.
Most probably the best tool which grafically represents PODs scaling is Minikube dashboard, but in my case I will use simple console:
#6.kubectl apply -f spark-operator-gcp/3.spark-app-gcpoperator-scaled.yaml
kubectl get po
Next app is totally the same, just with input argument set to 100000 which gives a bit more load :)
#7.kubectl apply -f spark-operator-gcp/4.spark-app-gcpoperator-scaled.yaml
kubectl get po
Vertical auto scaling
There is no such term as vertical auto scaling in K8s world, but my understanding and personal opinion is that POD can scale inside itself with provided resource and resource limits. Probably that might be possible only with operators:
I haven’t found option to set “vertical” scaling in executor POD — something like cores/coresLimit and memory/memoryLimit. Some similar configuration items exists, but actually they represents different functionality.
Conclusion: horizontal auto scaling of Spark workers with GCP Spark operator is pretty obvious and simple. There were not HPA object type created during scaling — so that is a bit strange. Overall it’s great user experience, as it has decent documentation and wide community. Personally — love it!