Running Apache Spark Jobs Using Kubernetes

Download Slides

Apache Spark has introduced a powerful engine for distributed data processing, providing unmatched capabilities to handle petabytes of data across multiple servers. Its capabilities and performance unseated other technologies in the Hadoop world, but while Spark provides a lot of power, it also comes with a high maintenance cost, which is why we now see innovations to simplify the Spark infrastructure. Kubernetes on its right, offers a simplified way to manage infrastructure and applications. Kubernetes provides a practical approach to isolated workloads, limiting the use of resources, deploying on-demand and scaling as needed. Yaron Haviv will explain how to work with Kubernetes to build a single workflow with Spark based data preparation and ML tasks. Participants will learn how running Spark with Kubernetes enables users to unify analytics and data science on a single cloud-native architecture and eliminate the overhead of an extra big data cluster managed by different tools.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hello, hi, everyone, we’re gonna cover today how to run Apache Spark on Kubernetes. So it will be myself I’m Yaron, I’m CTO and founder for Iguazio. And we have with me Marcelo, longtime big data expert and Spark expert from our team that is going to also show the demo and talk about the details. So with that let’s move to the first slide.

So, first, you know, let’s talk about the challenges today. One of the critical elements when you’re building machine learning and AI is to build a production pipeline. So not only work with CSVs, Excel spreadsheets, et cetera. In research, the real problem is deploying that in production. You have real data, you have streaming data, you have to do ETL, from operational databases, you have to integrate with API’s. And once you bring all that data in, you have to run analytics at scale, not only on within a single Jupiter container, you know, image or container or notebook. But you want to scale out the processing of the data. And once you’ve done that you run training and serving, and all of that pipeline. And Spark is very instrumental in processing data at scale. But on the same type needs to coexist with other elements of the stack, as we’re going to present. Then as part of that we want to move into a microservices architecture and use Kubernetes as sort of the baseline for that. When we look into the entire pipelines for machine learning and or AI, now we have different steps.

Spark Help Us Scale ML Pipeline

We have the step of data ingestion, followed by preparation, which is joining, aggregating, splitting data and turning into meaningful features. Followed by training where in the training, you want to run various algorithms, various training models on the same data set to generate this model. And once you’ve ran the training, you want to move to a validation step. Take some portions of the data around the model against it and verify that you have decent accuracy. And finally, you want to deploy those models alongside with API’s that come together with them for the feature vectors. If we’re looking into this entire pipeline, we can see that Spark and can play a major role in some of those steps, you know, especially around data ingestion processing, data preparation, and analytics at scale. On the same time, we have many other frameworks in the industry around using Python packages and libraries like Scikit-learn and TensorFlow and others that may work as a microservice. So the overall architecture should be one that you have sort of a data lake or data sets running on a shared medium, you know, cluster file system, HDFS, manage databases, whatever. And on top of it, you need to have those microservices. And the microservices can comprise of portions of them will be Spark jobs or Spar-based containers. And some other will be other like API functions and model serving functions, whereas Spark may not be the best tool for the job. So, essentially, we wanna create a single cluster and have those two types of workloads in a single environment.

Why ML Pipeline

This is really where things like Kubernetes come into play.

Why Spark on Kubernetes?

So, you want to have one cluster, where you have a single framework for scheduling different workloads and Kubernetes is quite generic, unlike things like Hadoop, which is very specific to dig data workloads. And, it’s also pretty secure and it has a very vibrant community, which allow you to leverage all those innovation that’s been done therefore, for the recent theories.

Goodbye Hadoop, Hello Cloudlative

So, again, we want to move from Hadoop architecture, which is sort of keeping stagnant and is not really cloud native. What does it mean cloud native,? It means essentially being able to drive workloads for more agile architecture as microservices if they fail or need to scale that is done automatically using containers versus virtual machines or non-isolated workloads on top od Hadoop. So this is what we want to move into an architecture word data layer is becoming instead of a file system like HDFS with all of the limitation a very exhaustive of resources will directly cause et cetera. And we want to move into a managed storage or database framework when it stores things like object store in the cloud, or manage databases. The next layer of scheduling, we wanna move from a single-trick pony like Yarn into something very generic that can run any type of workloads and have a vibrant community as we discussed, which is called Kubernetes. And instead of running Middleware, which is very specific to big data workloads, we want to run all the different Middlewares all the different services on the same cluster and not have to manage those frameworks independently. And then we can focus on running our business logic and not focus too much on managing those clusters.

Spark on Kubernetes

So what is Spark over Kubernetes, or how does it work? And in a minute, we’re going to see a live demo by Marcelo. The general idea in Kubernetes is everything is a container. So you essentially, or a pod, which could be a set of containers. So we want to essentially go and launch such container. And in Spark, you know, we may have different types of containers, we may have like workers and executors or, and we wanna have a driver, which is the thing that launches the job and manages the lifecycle of the job. So, the first step would be to dock to some service on the cluster. We’re gonna see the, in a minute how it’s done, because there are different ways of doing it on Kubernetes. And this service is going to essentially go and spawn the driver and the worker, associate and assign resources like storage models, configurations, CPU-GPU resourcss into those workloads and launch them. And also manage the lifecycle of the jobs well, while they’re running. So for example, if something fails, it will automatically restart the resources. If there are aren’t enough resources it will essentially notify us et cetera. So this is what the Spark over Kubernetes provides as though is a mechanism to launch Spark-based containers and Orchestration associated with them with pushing the drivers and the workers and potentially other services for shuffling and another on the same cluster, And then manage the lifecycle of the job. Now with that, they’re essentially different ways to skin the cat as you say, there are three different mechanism and Marcelo is going to cover those as well as demonstrate how it works.

How to run your spark job in Kubernetes ?

– Hi, my name is Marcelo Litovsky, I’m a Solutions Architect with Iguazio. I’m doing continue on what Yaron was talking about as far as how to run Spark on Kubernetes. And there are different modes of running Spark on Kubernetes. But they all require, first of all that you have a Kubernetes cluster up and running, that you have the right permissions and you have the right components deployed to that platform. So the first way of running a job in Kubernetes with Spark is where your driver runs outside of where the rest of Spark cluster is running. So your your driver will run on a container or a host, but the workers will be deployed to the Kubernetes cluster.

There are some advantages or disadvantages of doing all three of them, but it all depends on your architecture and how other applications in your environment are running.

In the second case, you are submitting the Spark job but even the driver runs inside the Kubernetes cluster. So you have everything running inside the cluster. You still have to interact, there’s server dependencies as far being able to interact with your your Spark jobs. So in the first two, there is a lot of things that happen outside of the Kubernetes cluster that you need to be aware of and access that you need to be able to monitor their job to understand what’s going on. And the last one, the last option is to run it as using the Spart for K8S operator. Which is basically an operator in general in Kubernetes has the default template of resources that are required to run that type of job that your requested. In this case, it’s a cooperator for Spark. So it knows when you deploy your application that it needs to deploy a driver, that it needs to deploy a worker. It knows based on your specifications, how many replicas of each of the workers it’s gonna deploy. And then you can let Kubernetes run that cluster and use the Kubernetes tools to be able to monitor how you’re running.

Comparing Modes

So like I mentioned earlier, there is three different modes of operation.

When you’re working with the driver running outside of the cluster, the main thing is that you have to be aware of that communication between your workers and your driver. You also need that access to Kubernetes to be able to deploy the resources. So you have to have what is called a Kubernetes config file that determines who you are, it’s kind of like a login to the cluster, and it will allow you to connect and establish all, the deploy of the resources. In the first case, like I said, the driver will run on your container or on another host, but then the workers will be running on the Kubernetes platform. So the first two are very similar. You know, there’s a lot of things that are outside of the platform that you have to be aware of and you have to manage outside of Kubernetes. The last one is a little bit different where when you deploy a Spark job, the account, the service account, and I’ll show you that in the demo, that is associated with that job will have the ability to do that full deployment. And once you let that job start, you can interact, like I said, with Kubernetes, with the commands that I’ll show you in a couple of minutes, to be able to monitor the job to understand if errors are happening, how to look at the logs, how to communicate with a Spark cluster, the UI itself. So lots of things, all components that are built into that Spark operator for Kubernetes. So now I’m gonna give you a quick demo of doing things with the Spark operator.

Yaron, let’s make sure I share my screen.

Alright, so the first thing like I mentioned is that you have to have this Spark operator running on Kubernetes. I already have it installed, so if you use Helm, which is the package manager for Kubernetes, you’ll see that I have a version of the Spark operator running in my environment. I can look at any applications that have already executed.

So I can look at what’s already run in this cluster., and I can deploy new jobs. So the way I will do this, I have the YAML that defines how you deploy that job into Kubernetes.

I have, actually this is part of a GitHub project that we’ll share shortly. So first of all, I have my Python script. And I’m using examples from the Apache Spark project. So this is just a simple PySpark script that does a basic calculation. And I have the YAML that is gonna give the instructions to the Kubernetes cluster on how to run that script. And there are a few things that you have to be aware of when you’re looking at this YAML and start building. First of all, the version of your brain that you’re working with has to match what you are deploying. So this is the version of the operator down running. Then there is the image that you use. In this case, I’m using a Python. I’m gonna run a PySpark script. So this is a Docker image based on PySpark. If you’re using Scala, you’re gonna use a different image. This image also has to be built, so it has all the dependencies that each of the workers will need to execute. Now you can start like I did, using the images that are already provided by the Apache Spark project, but you can build your own based on on that image. Just to make it easier, you can add your own packages to that image. I’m also taking my application, the application, the actual spread is not part of the image, I’m going to pick that up from from a GitHub project.

I cann also run it locally. Now the caveat there is that you have to be aware that when this container executes, is running on Kubernetes it does not know anything about your local file system. And so you have to understand how to mount the required file system in order to be able to access a file that exists elsewhere, that is not inside that container. The way this YAML was built, and because I’m running on a single node Kubernetes cluster, I’m able to mount my temp directory, inside that container. So anything that I put in that temp directory. will be available to Spark as the application runs. I’m also defining the limits of the resources that I’m going to consume. So if you’re used to Yarn as a resource manager now there’s a lot of complexity to managing Yarn and managing permissions and resource allocation. Kubernetes makes that a little bit easier. So you can as part of the spec of running your application, you can define how many cores, how much memory, if you need to use GPUs on your process, you can also allocate GPUs to that container. You know, this is as long as your Kubernetes admin allows you to request those resources as well. There is some configuration that Kubernetes decide that limits what you can access. So I have a definition of what the driver, I’ve been calling workers but executors, why the driver on the executors, what resources are gonna request? The file system that I’m gonna mount and in this case I’m mounting, and this is a bad practice for production, but I’m mounting a local file system, just to do the testing. And then I also specified on the top, the script that I’m going to execute, which is going to come from a key repository. I also have just to show you, I did one, another YAMIL that I alter slightly to pick up the file instead of the Python file from a git, to pick up the file locally from that temp directory. And I’m able to do this because I mounted that file system inside the container. So whatever is in the temp directory, my node is gonna be available, I’ll be able to execute that application as well. So in order to run this application, I have the YAML. What you have to do is you have to tell Kubernetes to deploy the application. Now I have run the application events, so this is gonna end very quickly, it’s deployed to the cluster. Now is running on the background and obviously, you need some way of being able to interact with the cluster to know what the job it’s doing and how is it operating. So the first thing is what I showed you earlier, you can run a getApplication. So we know that the one that I just executed should be here, or is probably still deploying.

Then I can look specifically at that application, look at the pods that that application is running. And you can see that now my driver and my worker, I had one driver and one worker, the workers should be coming online shortly. So those are running. Now, as you start doing development, you’ll likely run into some errors. Some things I’m not gonna be available, some things are gonna be missing. So you want to be able to interact with this application and understand what’s going on. So you can use the Kubernetes commands. There’s Kubernetes command I call logs, that will let you look at the logs of the driver. So you can see exactly what happened. If there was an exception on error, you will see it here. And you can also see the output of the script. I believe that this one is still the one that I just trigger is still executing. So if your job is executing, you can do a dash f. That actually tells the execution log, but I guess by the time I did that he had finished already. And as you look through the output, you should see the result of the calculation. And, it is a lot of output, some debugging the application.

That I just, I’m missing it, because it’s only a single line of output.

I’m gonna do something much easier which is grep, so we can see. So basically, this is the output of my script.

The version that I have for the local file system gave me a more interesting output, but I run the one that is on GitHub. So it is different than, it has a different layout of the output. So just to do a recap of how to look at your application, how to troubleshoot, and close the demo with that.

Get Spark applications will give you a list of all the applications are running. This is the application that was executed earlier. If you want to look at the pods, based on an application name, you can look at the pods with get pods, and I’m just donna a grep for that application that I’m running. And the last thing that more importantly is to be able to look at the logs to see the output.

Now when you’re done and you want to clean up, the last thing to do if you want to keep your cluster clean, I don’t have all those applications, laying around, you can actually delete the application. Before when I did the first command I did it and apply with the YAML and created the application and run the code, I do I delete, and my application will no longer be that was a typo. And my application will know where it exists. So when I do a get Spark applications that my application will longer exists in the cluster. So this is how you handle running a Spark Kubernetes job using the YALM and using all the kubectl command. This requires a good amount of understanding of how Kubernetes works. You know, you have to make sure that your YALM is formatted properly for that version of the Spark operator. You have to make sure that the location of your Python script or your Scala script is available to the containers when they execute. That means that you also have to make sure that you can mount correctly whatever file system resources your need at a runtime. With that, that was the end of the demo.

I stop sharing my screen.

Let me go to the next slide.

I do have a repo with the demo that you could reproduce this on your desktop with Docker for desktop for Windows or for Mac, I did this repo last year for our MLOps conference and it probably needs a little tuning. Before you watch this video, I probably will go in and make the necessary changes so you can execute on your laptop as well. – Okay, thank you, Marcelo, for a great demo. And let’s continue and see how potentially we’re going to improve the experience through some additional tools.

DevOps Challenges Remain

Because as you see some of the challenges around DevOps still exist. We need to configure YALM, it’s not so trivial for someone that just wants to launch a job. And we need to configure the resources and security attributes and volume mounts and all of that and package dependencies. So there are still a bunch of things that we need to address. Also scaling and scheduling of the workloads in a slightly more automated fashion, And also, how do we integrate with other frameworks, how do we embed various packages into the image that we wanna launch? So we need to build images every time we launch. So there are still a lot of challenges that force us to spend a lot of DevOps resources. And this is where we’re trying to improve the experience with some extra functionality that we’ve added here in Iguazio. So the first idea is we said, you know, how do you automate that DevOps today? And the way to automate DevOps today, the most automated fashion of DevOps today is the concept of serverless functions. If you go to, for example, Amazon, you have lambda functions, you write some code, you click and now you have a function that runs in production and responds to HTTP requests, okay? So no one needs to build containers, no one writes any YALMs like Marcelo showed us, you just write the code and you deploy it. But the key challenges why serverless is very friendly, very nice as resource elasticity. You don’t use it, you don’t pay for it, you automate the entire lifecycle of development and operation, it’s still not a good fit for data intensive and machine learning workloads. Because they’re different attributes to those workloads and serverless functions. So today, if we examine existing serverless technologies today, they have very short lifespan, you know, after five minutes, your lambda crashes. And you may have jobs that run for 30 minutes for training. If you’re running scaling, you want to scale your workload, serverless technology usually do that using an API gateway and a load-balancer that’s frozen different requests to different containers. Well, if you were thinking about Spark and other technologies like Dask or Horovod or other machine learning and data analytics frameworks, they’re using other strategies like shuffle, reduce and RDDs in the case of Spark, or potentially hyper-parameter tuning or with some other frameworks. So we need the different approach at scaling. And also, in most cases, it’s stateful because we’re processing data. So if serverless functions are usually stateless, and you attach to resources through some HTTP calls or Esri calls, you want to be able to access data from your function in order to process and analyze it, not necessarily through an HTTP hook. And the last thing is the way when you’re running a job, you’re not parsing a JSON event as the request, like usually in every serverless function that are restful typically. What you wanna parse is a set of parameters for your job, you know, data sets, this is the input data, this is the query I want to run on the data or this is the input data and this is the model I wanna train with those parameters. So there is a different input. So with that we understand we cannot just use a serverless technology that was invented for event-driven or real-time workloads, like lambda, or in our case, an open source project called Nuclio, which we maintain. We need a different story. We need to extend the serverless and still maintain the inherent benefits of elastic scaling, not paying for things that we don’t use and automating the DevOps’ work, but we wannna apply it to the problem at stake. So we want to potentially create, think of it as serverless Spark, or a serverless Dask, or serverless Horovod. And this is really the idea behind what we call Nuclio malfunctions, or in combination with a framework called ML Run, which is automating this work. So we’ve seen that within Kubernetes, there are things called CRDs or operators. They know how to take code, and through and a spec usually described through a YAML file. They take those two things and they run a job and then they finish, okay? So we understand that there is your code, or a service that runs on something, there’s what we call a runtime, which is this operator. And we could have different runtimes, once for Spark and once for just regular jobs, or once for Dask, if you sort of more into Python camp or Horovod for deep learning and Nuclio for real-time functions, et cetera. So different types of resources that embody the code. So we want to create the function that comprise of those two things, the specification and the resources along with the code. And abstract it in a certain way. And then we want to define the contract between those functions and the user, what does the user wanna do? He wants to parse data, parameters for the job, potentially secrets and credentials, because if my function is going to access some external database, I may need the password. So I don’t wanna parse the password through clear text. So I need a mechanism for parsing credentials into my code. And once the function finished executing it’s gonna generate some outputs, it’s gonna generate results, you know, for example, accuracy of my model. It’s going to generate operational data like logs, telemetry data, monitoring, CPU usage, et cetera. And it’s going to generate new data sets. Like, you know, I’m running analytics, I have a source data set, and then I have the analyzed data set with the join results or whatever. Same for models, you know, I’m running a training on some data and I generate a model. So what we want to define is such a box, virtual box, where we throw in parameters and data and secrets, and we get back results, logs and their artifacts through a simple user interface. And we also want to be able to define those function as a set of a bigger story, a pipeline. So for example, we may have a function that ingests data, and then another function that prepares the data and a third function that runs training on the prepared data and the fourth, which does validation, et cetera. So we need a simple way to essentially plug those functions into such a generic architecture. And this is really where we invented the concept called ML Run. Which is automating this entire thing, no, YALMs, you’re only focusing on simple code and everything that you see here is fully automated.

Serveriess Spark ML Function Exampie Run locally (in the notebook or attched Spark service)

So for example, when I have my function,

Serverless Spark ML Function Example

I can write something inside a notebook, and we’re gonna see a demo in a minute, with with Marcelo here. So I can take some code and I define a function, the function includes the code. And also include some definitions of resources like CPUs or jars and things like that. So I get, essentially create a Python object that encompasses all the configuration. This is what we call the function. Alongside with the code that is, you know, is a pointer to where I’m storing my file. and I can store my files in various (voice drowns). And we can do, we can use different execution engines, as Marcelo showed us that are free ways of using the cluster. So each one has some slight advantages and disadvantages, I may want to run Spark locally within my Jupiter notebook or maybe a guess a live cluster that he’s always up and running, not sort of decommissioned when I’m not using it. But on the same time, I know you wanna run exactly the same job against a cluster that is built ad hoc for my job and I have a way to customize the image and the Spark version that I’m gonna use and all the packages that I wanna or jars I need to use for that job, provides me a lot more power. And it’s only going to be used while I’m launching the job, so it’s a resources. And other time on the other end there is some downside like it takes some time to spawn this cluster and create that. So we have a very short task, we may not want to do that. So you see this mechanism, I can just run exactly the same code without modifying too much, just few configurations, I can run the same code locally inside my notebook, and the same code with slight configuration change in the cluster in a distributed fashion. And we’re gonna see that in a demo in a minute.

KubeFlow+ Serverless: Automated ML Pipelines

The other point that I mentioned is that we want to compose a workflow from all those different function instances. And the best solution in Kubernetes to run a workflow is something called KubeFlow, meaning Kubernetes and workflow, or KubeFlow pipelines is a specific project, open source project, initiated by Google originally, and other companies, and you could essentially just plug different steps within a pipeline and say, “You know what, “the first step is getting the data, “the second is preparing and training and so on.” So we can use this tool, which is great. It also knows how to record information about artifacts and logs and other things. And we also couple it with another tool called ML Run, developed by Iguazio and it’s open source, that allows me to track everything in a very easy way as we’re gonna see in a minute, no YALMs, no kubectl command, just everything through, UI or notebooks or, and we’re gonna see that in a minute. So introducing serverless into a pipeline allows us a very simple composition of such a workflow. Because every step in that pipeline could be a self-managed function, with its own lifecycle and versioning, and images and definitions that we define only once we store it, and then we put it in when we need to use it. Another thing is that we can take a single workflow and combine jobs as some of them are simple Python jobs. Some of them are, you know, maybe GO code or Java code, some of them Spark jobs, et cetera. Some of them maybe TensorFlow service. So we can combine all of those different things into a single pipeline unlike solutions like Hadoop, where we’re very confined to just a set of microservices for big data.

Fraud Prevention Case Study Payoneer

One example of a solution that is deploying this technology in production is a company called Payoneer, that’s one of our customer and there is a public use case around what they’ve done. There are Unicorn in the payments services. And they wanted to achieve two things, they originally had Hadoop clusters, they wanna achieve two things. The first thing is moved from fraud detection into fraud prevention, which is very critical for their business because if they can prevent fraud, it impacts their bottom line. They can make more money, they can take more risky proposition. And the second thing they wanna solve is cut the time to production. Instead of having very long release cycles, investing huge amount of resources, they want to be able to just update things and launch into production in more service CI/CD fashion and agile mechanism that’s enabled through the microservices architecture.

Traditional Eraud-Detection Architecture (Hadoop)

So the first solution that they had based on Hadoop was to slow, very cumbersome. So you know, for example, a typical ETL workload, you take data from the SQL you run it ETL, using Spark, you ran some aggregations and some processing on that, and then they use some R Server to predict this entire process from ETL job to prediction, and blocking the customer from fraudulent transaction was taking 40 minutes. 40 minutes that’s pretty long time, and it allows you to steal a lot of money if you want to. So that’s not ideal solution. The next is, managing Hadoop and Spark in this cluster silo is very resource intensive as many of you have known. And it takes a lot of time to productize a new application or a new version of that application. So with that, they wanted to move to this architecture which is entirely serverless, fully automated.

Moving To RealTime Fraud Prevention

So they achieve to two things, one is everything is becoming real-time. And the second is fully automated DevOps and deployment of software. So they can deploy a lot more versions of the software actually on a weekly basis, every time they wanna change the model, every time they wanna change the logic, very easy to do. And at the same time, they managed to get real-time performance with this microservice architecture. So, the example is now instead of doing ETL, you’re just going to ingest the data through CDC, for a stream using RabbitMQ. The first stage is using Nulcio function, is serverless function, they have native support for RabbitMQ protocol, and they’re real-time. So they can essentially ingest the data, crunch it immediately and pass it and write into a database in a sort of slightly cleaned up way. The second step is a Spark function which essentially runs also as a job periodically, a process their all the data that was ingested and combining real-time data with some batch data. The third step is using the model training functions which are simple Python functions with Scikit-learn in this case. So those functions essentially read the data from the database or the feature store. The data that was generated by Spark or by the real-time ingestion function, they build the models out of that data. And they store the model back into the database or the system. And then you have model inferencing functions that actually listen on the live stream of events. And based on the enriched data that was generated by Spark, and the real-time data that was generated through the ingestion function, they make predictions. And and if they suspect that the transaction for that specific user is a fraud, they essentially just firing an event immediately into the database and change the user state to blocked.

So this entire process take about 12 seconds from the minute we’ve got the event, and we pass it to through some real-time analytics pipeline, until we actually block the counter. So it’s from the first indication of fraud. So likely to use will not be able to steal money because it’s like instantly going to block their further transactions that he’s going do on the system. And the other aspect, as I mentioned is because it’s serverless with automated DevOps. And then that means that there is no way you can do releases as much as you want. Everything is rolling upwards without even taking the system down. All the monitoring, logging, telemetries, all built-in to the solution, you don’t need to develop anything to achieve that.

So with that, let’s move into the second demo of showing more of a serverless architecture for Spark and KubeFlow. – All right, so Yaron gave you an idea of all the other elements that have to exist to be able to incorporate this as part of full ML application. We also talked about the complexity that it adds to do everything in YAML and kubectl, kubectl. And then how we can do it using ML Run, which is a component that allows you to encapsulate all those resources that you need to run your PySpark application without having to worry about the YAML, the images and everything else that goes with that infrastructure. So I took the same code that I run from the command line using the basic Spark operator for Kubernetes. I added some artifacts of ML Run to be able to keep track of what I’m doing. So I’m basically logging in into ML Run and you’ll see what this does when I execute it. So when you start wanting to keep track of your experiment, it’s good practice to logging artifacts, logging results, so you can keep track of everything that you execute. So in essence, this is the same script except that I send the output instead of printing to the output, to sign up output, I’m actually saving it to artifacts database. And I’ll talk to that in a few minutes. Now, as a developer, you’re working on Jupiter notebooks. And you want to test the full extent of your application from running it in your notebook to run it at scale, in Kubernetes. So when you run your notebook, your notebook is a single container, and it has limits of how many CPUs, how much memory you have located, you might need GPUs and you don’t have GPUs, so there’s some constraints. So how do you go beyond that environment to be able to execute your PySpark code. So ML Run gives you the ability to use the demo run in API and facilitate that communication with a Kubernetes cluster. So your script can run seamlessly in the Kubernetes cluster, but without having to exit your notebook. So I have a few definitions that are similar to what you saw on the YAML but now I’m using specifications. I’m using a language that I’m more comfortable with, you know, I’m defining where my script is, I’m defining how much memory, how much CPU, I’m gonna use, how many replicas of the execute, how many executors are gonna exist?

I’m picking up the dependencies, the jars that I need to be able to run my code. So I am actually leveraging what I have in the context of my Jupiter notebook, the left all those dependencies.

And then, instead of having to write the YAML, I use ML Run to define the simple things that I need to be able to execute this. First of all, it I sa kind of Spark job. So this is gonna be a Spark job. It’s gonna use this image as a base image for the execute. I provided the location of this script, and as you can see, it’s gonna be local to the image show. So, I have to know that that is gonna be mounted in the location that I expect it to be mounted at. Which, you know, if you once you start learning Kubernetes, you see that there’s multiple ways of doing that. Now with ML Run, we’ll give you a few tools to do that without having to worry about that complexity. First thing is to be able to mount it, once you have some file system. So I can mount with a simple command, now the file system. We also have a function to mount a persistent volume claim. So you can mount a volume from another volume that is outside of Iguazio. So that guarantees that you have access to the file system the same way that you have access to the file system in your notebook. So that the container environment is gonna have a very, very similar path to everything that you’re running.

There’s a few things that specify that I needed to run within our environment, and then I define the limits the requests, and the number of replicas. The same way I did the YAML but like I said, in this case, we’re using stuff that a developer will be used to and things are friendly to a developer. And then I can- – [Yaron] I think another point, Marcello, potentially to add is the function definition is reusable across jobs. So I don’t have to repeat the definition of this function and it’s replica and CPUs, I do it only once. And then I execute it multiple times. – Yes so this object becomes part and I’ll show this in this capability in a few moments when I show you what’s next on this notebook as well. So I’ve defined this and I run it. Now the interaction with the execution is within my notebook but it’s running and it’s building the Spark cluster on Kubernetes, is running my code and you will see the same output that we saw when we run it from with using the YAML. So somewhere around here, I’ll see that Py result.

And I keep mentioning, it is a lot of output on this jobs.

There you go. So that format a little bit different so it’d be highlighted. Now the other thing is because I run it within this framework, if you look at the bottom of the job, you have a definition of any input, any output, any artifacts that I might have recorded during the execution. I also have a link to look at the history and the execution of this job and any artifacts associated with it. So ML Run provides a UI that gives you access to the execution of that job. You can also, here you can see the logs as well, so you can get there, say UI that you can use to see the logs. There is a version for each execution of your job. So as you’re running the experiment, if you’re recording results, if you’re recording information of your execution then you can look at the different versions of the jobs and the different parameters that you use. So by just wrapping around that PySpark script,

using the ML Run libraries, you’ll be able to keep track of all these experiments, and more importantly, is that me as a developer, I’m not constrained to the resources that I have on my notebook. I can leverage the Kubernetes cluster seamlessly without having to worry about Docker containers or YAML that I have to build to be able to execute the job. Now, I run that as an independent job but usually it’s not gonna be a single job. And that’s it, there’s gonna be a pipeline or a sequence of events that you have to follow to, you know, Spark might be doing the data processing, but you might have a TensorFlow job that is gonna do training on a model. So, we can take the exact same function as Yaron mentioned earlier. I do not have to redefine how that is going to be executed. I use the KubeFlow DSL to define the pipeline. I take the same function as I defined it above, and now I can actually incorporate it into our pipeline. I can run the pipeline from my notebook. And at the end of the run, I’ll have an experiment link that will to take me to KubeFlow. We have integrated KubeFlow one with Iguazio. So you’ll see that execution of that part step as part of our pipeline, and you’ll be able to as well, the logs and all the execution. Now obviously these a single step and there’s more complex pipelines, I’ll show you another example of a pipeline that has a lot of, that’s the typical processing,

of machine learning pipeline. We’re looking at, you know, getting the data, training the models, providing you a summary of the data as you consume it, as well as, as part of the pipeline deployed in the process. And if you’re getting data on your data prep step was a Spark job, you can run that as part of the same pipeline, mixing steps that run Spark for Kubernetes jobs with TensorFlow with other components of your pipeline, the developer, different teams, within the application MR. So the key thing is, for me as a developer is ability to do all this using something that is very familiar to me, that puts together the full machine learning pipeline, leveraging, not only Kubernetes but we’re leveraging Kubernetes, we’re leveraging Spark for Kubernetes, we’re we’re using KubeFlow to build pipelines, as well as, Yaron, mentioned earlier on several as components to be able to do inference in in real-time.

And with that, I’ll end my my part of the demo. – Thank you, Marcelo, that was a great presentation, that great demo. And I think what we’ve seen through Marcelo is how you know, on one hand Kubernetes is great as we mentioned before, it provides you a single cluster that you can run all your different workloads, all the modern workloads side by side with traditional big data application for Spark, Presto,Hive all those can run them on Kubernetes The chat With running Spark on Kubernetes is there’s a lot of manual work associated with that and a lot of DevOps work, messing with the animals and kubelctle commands et cetera. And what we’ve seen this throughout this serverless approach that we’ve introduced with ML Run, you have full automation of since you’re launching your job from within the notebooks, together with various configuration, those functions by the way, they’re also stored in a database, I can always pull them in a later and use them or I can like other functions from a data base into a new pipeline that I’ve created. And you have all the logging of your outputs presented directly to you in the Jupiter notebook. So you don’t need to wander around with CLI commands or recorded in this tool called ML Run that I can just go later and see what what was running, what was the git commit version of the the job that was running, what were the results, et cetera, for everything that happened? And I could launch jobs for UI, and all those things will make my life much easier. So, with that, and again, thank you, Marcel for the demo. We and everyone that wants to learn more about those technologies, about how to run Kubernetes on Spark or how to further automate that using serverless technologies, you know, just ping me or Marcelo, my Twitter handle is “@yaronhavi” my full name. You can look me up in LinkedIn or other places, and Marcelo well also is very happy to help people.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Yaron Haviv


Yaron Haviv is a serial entrepreneur who has deep technological experience in the fields of big data, cloud, storage and networking. Prior to Iguazio, Yaron was the Vice President of Datacenter Solutions at Mellanox, where he led technology innovation, software development and solution integrations. He was also the CTO and Vice President of R&D at Voltaire, a high-performance computing, IO and networking company. Yaron is a CNCF member and one of the authors in the CNCF working group. He presented in various events including KubeCon + CloudNativeCon, Spark + AI Summit and Strata.