visit
Architecture: What happens when you submit a Spark app to Kubernetes
You submit a Spark application by talking directly to Kubernetes (precisely to the Kubernetes API server on the master node) which will then schedule a pod (simply put, a container) for the Spark driver. Once the Spark driver is up, it will communicate directly with Kubernetes to request Spark executors, which will also be scheduled on pods (one pod per executor). If dynamic allocation is enabled the number of Spark executors dynamically evolves based on load, otherwise it’s a static number.Apache Spark on Kubernetes Reference Architecture. Original image by author.
Spark Submit vs. Spark on Kubernetes Operator App Management. Original image by author.
We recommend working with the spark-operator as it’s much more easy-to-use!Use SSDs or large disks whenever possible to get the best shuffle performance for Spark-on-Kubernetes
Shuffles are the expensive all-to-all data exchange steps that often occur with Spark. They can take up a large portion of your entire Spark job and therefore optimizing Spark shuffle performance matters. We’ve already covered this topic in our article, (read “How to optimize shuffle with Spark on Kubernetes”) so we’ll just give our high-level tips here:Then you would submit your Spark apps with the configuration spark.executor.cores=4 right? Wrong. Your Spark app will get stuck because executors cannot fit on your nodes. You should account for the overheads described in the graph below.
Overheads from Kubernetes and Daemonsets for Apache Spark Nodes. Original image by author.
Typically node allocatable represents 95% of the node capacity. The resources reserved for depends on your setup, but note that DaemonSets are popular for log and metrics collection, networking, and security. Let’s assume that this leaves you with 90% of node capacity available to your Spark executors, so 3.6 CPUs.This means you could submit a Spark application with the configuration spark.executor.cores=3. But this will reserve only 3 CPUs and some capacity will be wasted.
Therefore in this case we recommend the following configuration:spark.executor.cores=4
spark.kubernetes.executor.request.cores=3600m
This means your Spark executors will request exactly the 3.6 CPUs available, and Spark will schedule up to 4 tasks in parallel on this executor.Advanced tip:
Setting spark.executor.cores greater (typically 2x or 3x greater) than spark.kubernetes.executor.request.cores is called oversubscription and can yield a significant performance boost for workloads where CPU usage is low.In this example, we’ve shown you how to size your Spark executor pods so they fit tightly into your nodes (1 pod per node). Companies also commonly choose to use larger nodes and fit multiple pods per node. In this case, you should still pay attention to your Spark CPU and memory requests to make sure the bin-packing of executors on nodes is efficient. This is one of the dynamic optimizations provided by the platform.
Enable app-level dynamic allocation and cluster-level autoscaling
This is an absolute must-have if you’re running in the cloud and want to make your data infrastructure reactive and cost-efficient. There are two levels of dynamic scaling:1. App-level dynamic allocation. This is the ability for each Spark application to request Spark executors at runtime (when there are pending tasks) and delete them (when they’re idle). Dynamic allocation is available on Kubernetes since Spark 3.0 by setting the following configurations:spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.shuffleTracking.enabled=true
2. Cluster-level autoscaling. This means the Kubernetes cluster can request more nodes from the cloud provider when it needs more capacity to schedule pods, and vice-versa delete the nodes when they become unused.Kubernetes Cluster Dynamic Allocation and Autoscaling for Apache Spark. Original image by author.
Together, these two settings will make your entire data infrastructure dynamically scale when Spark apps can benefit from new resources and scale back down when these resources are unused. In practice, starting a Spark pod takes just a few seconds when there is capacity in the cluster. If a new node must first be acquired from the cloud provider, you typically have to wait 1–2 minutes (depending on the cloud provider, region, and type of instance).If you want to guarantee that your applications always start in seconds, you can oversize your Kubernetes cluster by scheduling what is called “pause pods” on it. These are low-priority pods that basically do nothing. When a Spark app requires space to run, Kubernetes will delete these lower priority pods, and then reschedule them (causing the cluster to scale up in the background).Illustration of app-level dynamic allocation and cluster-level autoscaling. Original image by author.
Use Spot nodes to reduce cloud costs
Spot (also known as preemptible) nodes typically cost around 75% less than on-demand machines, in exchange for lower availability (when you ask for Spot nodes there is no guarantee that you will get them) and unpredictable interruptions (these nodes can go away at any time).Spark workloads work really well on spot nodes as long as you make sure that only Spark executors get placed on spot while the Spark driver runs on an on-demand machine. Indeed Spark can recover from losing an executor (a new executor will be placed on an on-demand node and rerun the lost computations) but not from losing its driver.To enable spot nodes in Kubernetes you should create multiple node pools (some on-demand and some spot) and then use node-selectors and node affinities to put the driver on an on-demand node and executors preferably on spot nodes.Monitor pod resource usage using the Kubernetes Dashboard
The is an open-source general-purpose web-based monitoring UI for Kubernetes. It will give you visibility over the apps running on your clusters with essential metrics to troubleshoot their performance like memory usage, CPU utilization, I/O, disks, etc.Pod Resource Usage Monitoring On The Kubernetes Dashboard. Source:
The main issues with this project are that it’s cumbersome to reconcile these metrics with actual Spark jobs/stages and that most of these metrics are lost when a Spark application finishes. Persisting these metrics is a bit challenging but possible for example using (with a built-in servlet since Spark 3.0) or .How to access the Spark UI
The Spark UI is the essential monitoring tool built-in with Spark. It’s a different way to access it whether the app is live or not:1. When the app is running, the Spark UI is served by the Spark driver directly on port 4040. To access it, you should by running the following command:$ kubectl port-forward <driver-pod-name> 4040:4040
2. When the app is completed, you can replay the Spark UI by running the Spark History Server and configuring it to read the Spark event logs from a persistent storage. You should first use the configuration spark.eventLog.dir to write these event logs to the storage backend of your choice. You should then follow this to install the Spark History Server from a Helm Chart and point it to your storage backend.
The main issue with the Spark UI is that it’s hard to find the information you’re looking for, and it lacks the system metrics (CPU, Memory, IO usage) from the previous tools.Overview Of . Original image by author.
For this reason, we’re working on a Spark UI replacement at Data Mechanics which will include both system metrics and Spark information and hopefully give a much better user experience to Spark developers. This product will be free and cross-platform, we’re actively working on it and hope to release it soon so stay tuned! .[SPARK-25299] Using remote storage to store shuffle files without impacting performance. Original image by author.
At , we firmly believe that the future of Spark on Kubernetes is simply the future of Apache Spark. As one of the first commercial Spark platforms deployed on Kubernetes (alongside Google Dataproc which has beta support for Kubernetes), we are certainly biased, but the adoption trends in the community speak for themselves.Previously published at