Since initial support was added in Apache Spark 2.3, running Spark on Kubernetes has been growing in popularity. Reasons include the improved isolation and resource sharing of concurrent Spark applications on Kubernetes, as well as the benefit to use an homogeneous and cloud native infrastructure for the entire tech stack of a company. But running Spark on Kubernetes in a stable, performant, cost-efficient and secure manner also presents specific challenges. In this talk, JY and Julien will go over lessons learned while building Data Mechanics, a serverless Spark platform powered by Kubernetes.
– Welcome everyone, we’re very happy to be one of the first session of the Spark Summit. So in this talk we’re gonna give you our best practices and pitfalls that we learn while running Spark on top of Kubernetes.
So first a few words about who we are, we’re JY and Julien, the co-founders of Data Mechanics, the Y Combinator back startup, building a very easy to use Spark platform. I’m JY, I previously worked as a software engineer and Spark infrastructure lead at Databricks and Julien, our CTO, previously worked as a data scientist and data engineer at ContentSquare and BlaBlaCar. So I have more expertise with Spark on the infrastructure side and Julien has expertise with Spark as a user on the application side. So now that we told who we are, I’d like to do a quick, live poll, so as you know, the session is prerecorded, so, I won’t know the results here, if I had to take a guess, I’d say that the majority of you, never use Spark on Kubernetes, but we look forward to seeing the results. Maybe we’ll be surprised and by the way, even though the talk is recorded, I will be here on the chats with you in live to answer your questions. So feel free to make this as interactive as possible.
Great! So, here’s how we’d like to spend the next 25 minutes with you, I’ll first say a quick few words on Data Mechanics, to give a bit of background on our usage of Spark on Kubernetes in particular and then we’ll get to the meat of the presentation. First the core concepts of Spark on Kubernetes, then we’ll go over a wide range of configuration and performance tips, we’ll go over best practices for monitoring and security. And we’ll talk a bit about the future works that are being worked on in open source.
So first data mechanics, we are serverless Spark platform. What does that mean? We provide a serverless experience, to data scientists and data engineers, so they can focus on their application rather than the infrastructure. So how do we do this? We’re containerized, meaning applications start and scale quickly to speed up your iteration cycle, then we integrate with the tools that you already use, Jupyter Notebooks, your IDE, Airflow, and we make the transition from your local development, your existing workflow to running on data mechanics seamless and finally, our platform tools, the infrastructure parameters and Spark configurations automatically for each pipeline. For example, we tune the amount of memory and CPU to allocate to each Spark worker. We tune the default parallelism, the number of partitions, the shuffle configs, the instance types and so on. So this spodumene tuning has a massive impact on stability and performance, on the stability side, we can avoid or automatically remediate many errors, say out memory for example, on the performance side, we can fix fertilization issues over provisioning, bad shuffles, optimize DIO, on average we’ve being able to get the two X performance boost, meaning faster pipelines and lower costs. So if you have any questions about how this works, check out the presentation from last year at Spark Summit Europe and now I’ll move onto a topic that’s more relevant for today’s talk or architecture.
So our platform is deployed on a Kubernetes cluster in our customers’ cloud accounts. This is very important for security and data privacy reasons, you know, our customer’s sensitive data, never leaves their account. the Kubernetes cluster both skills up and down, at the minimum, there’s this one service running our gateway, which is the entry point for launching Spark applications, so you can point a Jupyter Notebook to it, or you can submit apps, programmatically through Airflow and other scheduler, or the API. The gateway will take your app, add configurations, it will scale it and give you access to a monitoring dashboard. So we built our entire product on top of Spark on Kubernetes, which is at Hawaii, we had the pleasure to learn some of its tricks, and that we’d like to share with you now. So first, the core concepts of Kubernetes, where does Kubernetes fit within Spark?
Spark, sorry. Kubernetes is a new cluster manager scheduler for Spark, previously you had the standalone scheduler, which is built in, but has limited features. You have the Apache Mesos, which was popular but has been, some decline over the last few years and then there is Yarn, which is by far the most popular resource manager, which is the Hadoop resource manager And basically many platforms run on Yarn like EMR, Data Proc, Qubole, HD insight, et cetera. Kubernetes, is definitely the new cool kid on the block and Spark can natively run on Kubernetes since version 2.3.
So here’s a slide that shows the architecture of this native supports. When you submit a Spark application, you talk directly to Kubernetes, the API server, which will schedule the driver pod, so the Spark driver container and then the Spark driver and the Kubernetes Cluster will talk to each other to request and launch Spark executors. This can happen statically or this can happen, dynamically if you enable dynamic application. So like driver pod asked for executors communities starts then Delta driver pod and the driver is gonna start using these executors, to run the Spark tasks. There are two ways to submit Spark application on Kubernetes.
You can use Spark submit, which is you know, the Vanilla way, part of the Spark open source projects. However, app management is pretty manual. We’ll see this on the next slide and the are some configuration limitations, for example, you can’t specify notes selectors or affinities. Now the second way is the Spark on Kubernetes operator, it’s been open sourced by Google, and it works on top of any platform, on the negative side, it needs a long running system pod, so it needs to run a small container on the Kubernetes’ cluster, but app management is simpler and you get a lot of tooling for free, to read logs, kill application, restarted them, so this would actually be, our recommendation.
As you can see, on the left side, this is how you do a Sparks submits and after the Sparks submit, the only way to interact with your app is by doing Kubernetes’ operation on the pod. It’s not very powerful now on the right side, you’ll see that with the operator, a Spark application becomes a Kubernetes object that you can configure in Yamble, describe, delete, restart, et cetera. So, the last core concept I wanna explain is around dependency management and containerization.
On Yarn, you often have the shared Spark clusters where you run multiple apps concurrently. Usually you do this for cost reasons, but then you have to trade off on isolation. You must have a global spark version, the global Python version and global dependencies that can’t crash against each other. So another, another common issue is the lack of reproducibility.
You must run in it’s groups that can be flaky or you’re not entirely sure of the environment. Sometimes there’s a global, system changes on the cluster that can affect your app. Now in Kubernetes, on the other hand, each Spark app has its own Docker image, this means you can have full isolation and full control of your environment. You speak your Spark version, your Python version, your dependencies, some of our customers use docker to package all their and build a new image for each app and others prefer to just use a small set of base images and then in the space image, they put their more heavy, common libraries, and then they just add their execution code dynamically on top. So I’ll now pass it over to Julien, our CTO, who will give you concrete configuration and performance tips, to make you successful with Spark on Kubernetes. – Thanks JY I want now to give you some tips to get you up and running faster with Spark on Kubernetes and reach the level of performance that you’re used to with other schedulers like Yarn and Spark standalone. The first surprise when you weren’t a workload on, on Kube, is that you cannot just size executors to fit the size of your nodes.
Suppose I have a cluster with four core instances. If I said the number of course, per executor to four those back driver will never obtain many executive bots. And the similar thing happens with memory and what’s the reason, well, you need to take into account the share of resources that is reserved by the system and by Kubernetes, the full node capacity for course, is not available to pods.
What is available is called, Node allocatable, and it’s usually around 90 to 95% of node capacity. It depends on the size of your node and the Kubernetes settings by the cloud provider. And even that is actually not available to your executor pods as you usually have some DaemonSets running like fluentd or something. So, eventually in my example, you only have 85% of the node that is available to your, to your Spark executive pod. So, the correct configuration is, set Spark executor course to four, so that Spark runs four tasks in parallel on a given node, but sets Spark Kubernetes is executor request course two 3.4 CPUs, so that the pod is actually scheduled and created.
The next, tips that we want to share are about dynamic allocation. On Yarn, you can enable an external shuffle service and then safely enable dynamic allocation without the risk of losing shuffled files when Down scaling.
On kubernetes the exact same architecture is not possible, but, there’s ongoing work around these limitation. in the meantime a soft dynamic allocation needs available in Spark three dot o. Despite driver, trucks, shuffle files, and only allows the eviction of those executors that do not store, active, shuffled files. This works great in practice they’re on average, the scale down of Spark applications is slower compared to a full fledged dynamic allocation.
One cool setup that you can do with Spark on Kubernetes is having a Spark applications with dynamic allocation activated and due scanning on the cluster. So, when the Spark application, we really question you, executor, it takes us two seconds to create a new pod if there’s room on the cluster, but if there isn’t, the cluster will scale up and it usually takes one or two minutes to get a new node from the crop provider.
This in practice is a major cost saver.
On the other hand, if you’re willing to trade off some cost optimization for your lower latency, you can resolve, some extra resource in your cluster with over provision.
The idea is to run some low priority pods, that’s basically do nothing, They’re called, pause pods. And when a Spark application is run, the pause pods will be scaled to make room for the Spark executors, and then the pause pods will be get again and cause the cluster to scale up in the background.
if you want to go further in cost reduction, you can also use a spot or a preemptible instances, just make sure to run the Spark driver, on demand nodes and Spark executors only on a special instances. The reason is, if the node of the driver is preempted, the Spark obligation will fail.
This is easy to do in Kubernetes with, several node pods and you can assign, no definitive to pods, so that they’re preferably scheduled on nodes from a specific Notebook.
Now a few tricks about performance, most of know about this, I’m sure, an object storage like S3 or a GCS does not behave like HDFS. The latency of doing an operation on a single file is huge. So, it’s important to use a FileOutputCommitter that minimizes the number of operations. Some cloud providers right optimize the committers for their object storage, so, make sure to use them if it’s available to you and if it’s not the case, on your cloud provider, just use the version two of the Hadoop committer that’s with Spark it works pretty well in practice too.
shuffle, issue, the shuffle bound, workload, and just run it by default, you’ll realize that the performance of a Spark of Kubernetess is worse than Yarn and the reason is that Spark uses local temporary files, during the shuffle phase. And this causes a lots of disc, and it’s fortunately, Doctor File system is slow, to solve this issue. You cannot volumes in the executive pallets so that the file system is faster, there’s a variety of options that you can use here. The basic solution is to mount an empty dir, an empty dir is a temporary folder in the host Spark system That can used by the product, each implemented by default in Spark three at a three dot o, then the shell mounted, fast disks on your node, like 10 DME based SSDs. You can Mount host Path it’s a like an empty dir, but you can specify the path on the host machine file system. And in our case you can send the host path to the location of the mountain disk. And finally, you can use RAM as a local storage with the protocol MTFs. This is super fast that the stored files will count toward your memory limits. So be careful with this solution.
Spark performance is a hard topic, as you all know. And the few tips we shared go to show that it requires some knowledge of the nature of your workloads. Is it CPU bound, IO bound and so on. So you need a good monitoring infrastructure. And we’re gonna see some ways to do this, on the Kubernetes cluster.
The first type of monitoring to set up is port resource usage monitoring, the Kubernetes dashboard is an open source project that can install and that reports, CPU, memory, and disk usage in a web UI, the GKE console issue and GCP, provides a port resource usage as well. The main issues though with this type of monitoring, are, first that it’s hard to link, with Spark phases like jobs, stages and tasks and second, executor pods are removed once the Spark app is finished. And so you lose a lot of useful meta data.
Another classic of spot fine tuning is the spot history server, it’s relatively easy to set up with this Spark history server, Helm charts. But again, the main issue with the history server is the reciprocal of the dashboard. It lacks resource, you usage metrics. So, here’s a way to make a spot events and resource usage metrics.
What you can do, is export everything to a time series DB. This enables the superposition of a Spark stage boundaries and the resource usage metrics, like what you can see on the graph on a dashboard, on the right of, I won’t go into detail here, as there’s a great talk on this very topic at last year’s Spark summits by Luca Canali.
One last word for me about security, this is an area where a spot can leverage, the fact that it’s running on, that it’s running on Kubernetes, Kubernetes as a strong BLT in role based access control model and secrets management. And there’s a plethora of open source projects that you can plug into makes your life easier like you still, are HashiCorp vault. So I’m in a nutshell, a Spark, really benefits from all the latest, DevSecOps advancements for four, three, just by kubernetes.
I’ll now pass it over back to JY who will talk about the ongoing work on Spark on Kubernetes. – Great, thanks Julien.
So yeah, the next topic I wanna discuss is the, The features are already being worked on right now for Spark Kubernetes that we expect to come out in Spark three dot one and beyond First, the shuffle improvements, Julien touched on this a little bit, on Yarn, there’s a thing called the external shuffle service, which when enabled stores, shuffle data outside of the Spark executor, but still on the young cluster. The goal, with a Spark Kubernetes is to persist shuffle files to a remote storage outside of the cluster, probably an object store and maybe asynchronously to avoid a performance penalty, the goal to desegregate the compute and the storage resources is to enable the full unconstrained dynamic allocation. And it would also make Spark resilient to no loss, which is not the case of the Yarn external shovel service. Now, another resiliency improvement in the works, is around a better handling for node shutdowns. So, you know, now cloud providers give you a notice a few minutes before Spark or preemptible node is going away, today when we received this notice, we stopped scheduling tasks on the impacted executors, which is already great then, and this is implemented, but now the next improvement is to use this remaining time to live, to copy any shuffle data or any cash data to another node, so that you don’t lose it. Other you know, work’s being worked on is, local, sorry, support for local Python dependency upload we’re using Spark submit and also the general topic of Job Queues and improved resource management, which I won’t go into the detail now.
So to conclude, Spark on Kubernetes, should you get started? Is it worth it? Know That we are obviously bias in this answers is we chose to build an entire platform around Spark on Kubernetes, we think it’s the future of Spark.
The main benefits we see are, containerization and the simpler dependency management. Second, the integration enrich commodities ecosystem. and finally, the fact that you get both complete isolation and efficient resource sharing, when you use Spark on Kubernetes, you don’t need to trade off between the two. On the negative side, Kubernetes support is still pretty new, it’s still marked as experimental until a version 2.4. and there’s also a lot to learn and a lot to set up yourself as you get started. So to illustrate this point, here’s a high level checklist of things, you should do, if you wanna work with Spark on Kubernetes. On the infrastructure side, you need to set up, the kubernetes cluster itself, the Spark operator, a Docker registry, the Spark history server, some money touring for your logs and metrics and then on the configuration side, need to configure node pools, configure optimized IO, configure a auto-scaling maybe spots, nodes, or on demand nodes.
And this is not a one time thing, but probably something you’ll need to revisit, to, you know, keep watch on your performance and that’s a lot of work, and that’s really the reason why we built a serverless Spark platform, you know, we’ve done the setup work and data mechanics, and we added powerful optimizations on top like automated tuning, so that you can focus on your core business, building data models and data pipelines, rather than the DevOps work. Focus on your data while we handle the mechanics. So that’s it, we hope you enjoyed the thoughts we were available, in the chat to answer any questions you might have.
JY is the CEO and Co-Founder of Data Mechanics, a hassle-free containerized data platform that abstracts away the complexities of Spark and infrastructure management. Prior to that, he was a software engineer and Spark infrastructure team lead at Databricks, growing their cluster-management capabilities from early days to the scale of launching hundreds of thousands of nodes in the cloud every day. JY is passionate about making distributed data technologies 10x more accessible and resource-efficient through automation.
Julien is the CTO and Co-Founder of Data Mechanics, a Y Combinator-backed startup with the mission to make distributed computing accessible to everyone, starting with a serverless Spark platform running on Kubernetes. He previously worked as a data scientist on optimizing BlaBlaCar’s world-leading carpooling marketplace, and led the data team at the website UX optimization platform ContentSquare.