First steps with Apache Spark on K8s (part #3): GCP Spark Operator and auto scaling

All you will read here is personal opinion or lack of knowledge :) Please feel free to contact me for fixing incorrect parts.

In previous blogposts I have described how to setup environment for Kubernetes and how to auto scale Standalone Spark. In this post let’s have a look on GCP Spark Operator. Join my journey with K8s and Spark! Operator versions shown below:

Installation is pretty simple, just use helm. I’m not providing namespace as using default one:

#1.helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operatorhelm install gcp-spark-operator spark-operator/spark-operator

Double check if deployment was successful:

#2.kubectl get deployments
kubectl get po

To understand how horizontal auto scaling works in GCP Spark Operator I will use several scripts:

#3.git clone https://github.com/domasjautakis/spark-operator-gcp

As the first step service account, cluster role binding and then operator should be created:

#4.kubectl apply -f spark-operator-gcp/1.rbac.yaml
kubectl apply -f spark-operator-gcp/2.spark-app-gcpoperator.yaml

Horizontal auto scaling

Check example application:

#5.more spark-operator-gcp/3.spark-app-gcpoperator-scaled.yaml

What is the difference between provided example and this small dummy app?

The dynamicAllocation block is added in specification, which describes preferences of scaling. This block actually describes horizontal auto scaling parameters. In this case it’s set from 2 to 10 executors as required. Also arguments block is added which is used for input parameters. 10 — is relatively small random sample input parameter for this app, so it will scale, but not to the maximum replicas.

Most probably the best tool which grafically represents PODs scaling is Minikube dashboard, but in my case I will use simple console:

#6.kubectl apply -f spark-operator-gcp/3.spark-app-gcpoperator-scaled.yaml
kubectl get po

Next app is totally the same, just with input argument set to 100000 which gives a bit more load :)

#7.kubectl apply -f spark-operator-gcp/4.spark-app-gcpoperator-scaled.yaml
kubectl get po

Vertical auto scaling

There is no such term as vertical auto scaling in K8s world, but my understanding and personal opinion is that POD can scale inside itself with provided resource and resource limits. Probably that might be possible only with operators:

I haven’t found option to set “vertical” scaling in executor POD — something like cores/coresLimit and memory/memoryLimit. Some similar configuration items exists, but actually they represents different functionality.

Conclusion: horizontal auto scaling of Spark workers with GCP Spark operator is pretty obvious and simple. There were not HPA object type created during scaling — so that is a bit strange. Overall it’s great user experience, as it has decent documentation and wide community. Personally — love it!

--

--

--

Data Engineer

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

StarryNift Code Green NFT Airdrop

Mech Master Project

Accepting Payment with Braintree on NodeJs & Webhook

LGMVIP-THE INTERNSHIP EXPERIENCE BLOG

Python code obfuscation

ARKit + ARCore Tutorial: What did Google bring to the iOS platform?

GEE Tutorial #27: How to build an interactive Earth Engine App for creating Landsat timelapse?

📰 Sushi March Recap

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Domas Jautakis

Domas Jautakis

Data Engineer

More from Medium

Apache Spark 3.0 Exciting Capabilities

Collapsing Multiline Logs in Spark

How to Create a Spark DataFrame

Archiving Parquet files