Over the last year, we have been moving from a batch processing jobs setup with Airflow using EC2s to a powerful & scalable setup using Airflow & Spark in K8s.
The increasing need of moving forward with all the technology changes, the new community advances, and multidisciplinary teams, forced us to design a solution where we were able to run multiple Spark versions at the same time by avoiding duplicating infrastructure and simplifying its deployment, maintenance, and development.
In our talk, we will be covering our journey about how we ended up with a CI/CD setup composed by a Spark Streaming job in K8s consuming from Kafka, using the Spark Operator and deploying with ArgoCD.
This talk aims to provide an holistic view: from the early and more rudimentary model – batch setup in Airflow & EC2 instances – to the current and more sophisticated one – a streaming with CI/CD in K8s.
Speaker: Albert Franzi Cros
– Okay, welcome everyone, I’m Albert Franzi, lead engineer Lead at Typeform, and here I will be talking about who are we using Spark streaming in Kubernetes with Algo and a SparkOperator. The thing now is that we will talking we’ll cover different points, we will start talking about the SparkContext, so where are we nowadays, what we have which tools we have in place, then we will cover about Kubernetes so when Spark met Kubernetes, when which tools provided what capabilities, then we will cover how to deploy it how are we deploying into Kubernetes at Typeform and finally we will share some learnings. But first of all, I would like to introduce myself, as I said I’m Albert Franzi Data Engineered Lead at Typeform I’m leading the Data Platform team a cross-functional team performed by eight people, but previously I was working at as a data engineer in different companies, now we can list them as Alpha Health, Shifts Carsified Media and Trovit Search. Also, if you want to read some of my articles, you can go to Medium to Albert/Franzi or Twitter, you can follow me @FranciCross, even that the content on Twitter is 50% Catalan, 50% English, but I’m sure that the English one, everyone can take advantage of it. Then about Typeform, where I’m working right now, Typeform is a global company focused on providing a meaningful conversations now through forms and surveys, it’s a good company to work on, we have the privilege to try new tools and new technologies, that’s why we were able to start playing with Spark in Kubernetes and also using Argo and different tools, so there is really a good company to work on and to test the latest senses once they are alive. But let’s focus on what matters, the talk itself. So spark the thing is that as you know, I don’t know your background not the ones that are listening, but I remember when I start playing with Spark, it was more or less around the versongui What the Sikhs, some years ago you were able not to play using the local local weaker in your machine. Then you were able to play using a standalone spark, no way you have some different workers, and then we were moving forward, and I remember it was inspired to that one or less, when we were started playing with a yarn in a, on premise cluster and using puppet to the project, so it was not the best experience, but still it was good to to learn and good to see all the challenges that was bringing a Spark to the community. Also, it was good to know that Amazon released the EMR option, the Elastic MapReduce capability, that which has one click, you could have a ready to go Spark cluster, now and recently seems a Spark two or four, we have the Kubernetes option where we can use different advantages that I will be talking on the following slides. But also let’s focus now on what we use nowadays or what we used in the past to execute spark, now in different clusters not in different environments. As you know, we can have that Luigi and Airflow both are really good orchestrators., now Luigi provided by Spotify Airflow by Airbnb, both open source tools, this is really good, they are really good, but from my experience, they are good for batch processing, you know when we went to run aspire jobs in batch, we are using grounds or different triggers, Now for spark streaming, it’s possible, but it’s not the best tool that we could use. Then I know that some people and more, all the school is using a batch, they have a lot of batch scripts to run their own pipelines, their own as part of pipelines, not also the best way, a bit old school school but we will respect it, and then there is some Aliens you know some weird people using Amazon Glue, Amazon Glue really really good to have a mega store on the cloud, five meter store, so you can like synchronize some partitions if you’re using Athena, if you’re using play store, spectrum different tools, but not the best tool to orchestrate your Spark jobs. And then the important thing here, what matters is a sparking Kubernetes note this is really, really an new tool, a new feature, and for us it was a change, no, it was another level, so let’s focus, let’s keep talking about this. So what happened when a Sparkle meet Kubernetes, let’s bring before some insights, you know, as I told about Amazon EMR. So what happens with Amazon EMR you know, first of all, when you try to play with EMR now, is that you have to wait. Now it’s the first impression that you look at, one example is that Spark 3.0.0, it was released on June July, and then we need to wait until September for Amazon to release this version, so if you are really attached to an EMR cluster, you have to wait until you can use it. It’s true that you have time to develop, you can start developing in your local machines and then once Amazon and release it, then you can start using it because your code is ready, but not the best approach. Also you have a Spark Fixed version per cluster, what it means is that I will talk later about this, but imagine that you have different projects on different teams, each team has their own velocity, so if you start with Spark 2.1 or 2.4, then it’s like, if you want to create a new cluster to start playing with a new version, you have to have two clusters, now that means more costs, now you are paying two clusters at the same time because you are not capable to move all the previous work to the new portion, so that means more money and it means more maintenance, and no DevOps people, keeping an eye on the cluster and the kind of stuff. Also, as everyone is aware, having an EMR cluster, it means that you pay for the hours that the cluster is up not for the usage itself. It’s true that you can launch a new EMR cluster and then destroy it once you are done with your job, but even doing this, you have to pay these 15 minutes or 10 minutes that it takes to be ready for, now that’s also one setback from using EMR and also one important one is having the same IAM role shared across the entire cluster. It’s true that even having the same IAM role you can manage to use other roles, not assuming a role before you launch a new job or using Amazon key is not for its role. But what it means is for security reasons, even that you have different IAM roles, while you are assuming different roles in the same cluster, it means that other jobs can also assume the other roles, and so it’s not the best approach. One example here, it will be imagined having a machine learning team, having a finance team and having a data platform team, they are touching now different levels of sensitivity of data, maybe one is touching all the invoices of your company, everyone is touching I don’t know imagine health records and the other one is just handling normal data, I am usually behavior data from tracking. You don’t want to use the same role because you want to be sure that each team the only has access to her own data. So you will end up having three different EMR clusters, one per row, and then imagine that they have different inspirations, then it’s three clusters and then multiply these for each spiration that you want to have into production, not the best environment, as I said, you know so here is when we are start playing with Kubernetes nigh, We were really lucky as I stated, to be able to investigate, research and at the end, put it into a practice and production, this solution known as sparking Kubernetes. And the good part about the Kubernetes, you can have multiple spirations running in parallel, you don’t need to be fixed in one, you can have anything that you want, even if you want to try to run a version of 1.6 you could maybe only using a standalone approach, not the cluster approach, but even that you will be able to do it. Then one important thing is use what you need share what you don’t, you know as I said Kubernetes cluster has different notes and then you request on demand, if there are notes that are available, you will be able to use it, if not other people or other needs we’ll be using these resources. One thing on this case that we are experiencing right now is that we have some notes with APIs, and then we have some notes with a Sparkle bots, also one point that you can benefit take advantage of Kubernetes is that you can use node affinity, so you are able maybe to configure different and dispatch jobs inside the same note affinity that is some API’ so they are closer, so you know that the latency between dispatch jobs and the APIs that you need to retrieve some information, imagine matching models, it’s going to be lower, so it’s going to speed up your processes. Also, you can have a different EMR roles per service account year SA it’s called, I will talk later about this topic also, that means that as I was saying before, you can have finance machine learning and data platform running in the same corner, this cluster there, and be guaranteeing, note that each team is using their own role, and they are only accessing the data they should not other, so that can help on avoiding undecided usage of data between teams even that maybe they are the same company and we should trust you know or teammates, but it’s always to be sure that everything goes fine. Also, you can have different note types, as I said one thing that we have is that we identify that we have, we can have some small notes, provided by small machines, and that only when there is a huge big Sparks shop that needs to be run, then we can request a big machine to be included in the cluster. And also imagine that you can just in case of Amazon, you could have a M five machines not large, and then if there is a new, big job that enters to the cluster that doesn’t fit in one note that automatically it can request X large or two X large, you know, that’s really good because then it’s like, you can handle better the resources that you’re using in cluster and also at the end it’s money, so it’s better for everyone. And I define the best part also is defining your own dockers, you know I was saying, when you are using EMR, you are attached to the Amazon release, so you have to follow their rules, by using docker you can use whatever you want, you can be be determined your dockers, you can put some processes inside the docker and you can also so your own libraries inside it. Now imagine that you have a really big either no library that you don’t want to ship it in all your jobs, dispatch jobs, so you just ship it inside the docker, and then it’s usable by everyone, they don’t need to compile it inside, that could be an option, we are not doing this, but will be possible. So what do we need to use a Kubernetes? Now we need a cluster first of all, starting from 13, then an EMR IAM roles, here is what I said before, you’ll need, I will recommend to use ER say, IAM role service account, it’s possible to define some mappings between a service account and the role that can assume. So you’re gonna need to touch any Amazon keys or to assume any role before starting your job. Everything will be handled by Kubernetes, so with less things to worry about less namers, So that’s nice also. And one thing here to be able to use air I say inside, do dispatch jobs in Amazon, you will need to update the SDK version that comes with by default with three, head three to one, in that case, you will need at least the 788. This one is the one that comes with a web identity token credentials provider before it wasn’t, and the thing is that it includes the builder function, that it can be used with the credentials chain that a spiral will will be using to access the chat, I will explain why, and then the spark Docker you need to do at least I will recommend the 3.1 and a Amazon SDK Scala 212 dispatch three and Java eight. And then here I attach it, the Docker files that we’re using to build or own Spark image, feel free to use it, we believe that it’s at least less than you need, so which will be okay to use it in the production. So how do we deploy it? Now I explain some context and explain whole Spark MattKubernetes, now let’s go. How do we deploy this in an issue way in a friendly way to production. First of all, let’s introduce SparkOperator announced on project, developed it by the community, so I would like to first say thanks to all the people that is behind these awesome project, I believe that it’s a huge difference between what we had before, using a yarn, EMR, any resource negotiator, to be able to use a Kubernetes cluster now, without wondering, and with the capability to use different versions on the speed up or deliveries. So at the end, what it does is SparkOperator, it provides you the option to the final helm spec, where you define different properties, and the it takes charge of deploying it into the cluster and negotiate with their Kubernetes cluster, for the resources that you need to run this operator, this is a job. So here is an example of how it looks the spark application spec, now I just spit it into because it was a bit longer, but at the end, it’s quite easy to create and understand. There are more properties that you could add, but these ones are the minimum that you will need. There’s a page, I will share later the link that is cover all the specs that you can use to define it. Things that you have to keep in mind, the namespace, it’s hard quality in its bold, it’s highlighted in bold, the Namespace it will define the namespace of the Kubernetes, and also the service account, you know as I said before, you can use ER I say, IAM role service account mappings, so instead of using the default one, not the savings account default, let’s use the one that you need to run it, and also you can have like savings account machine learning, service account data platforms, service account finance, service account, whatever you have to define. Now, there is also some, you can define the image that you are going to use the base image of the Spark also the policy, the secrets, imagine that you have your docker, it’s not public, it’s not in the docker hub, You have the factory, Amazon Docker repository, you have anything, so you can specify and use and also you can define the secrets that you’ll need to go, not to use, to retrieve this docker. Also let’s check these kind of coumhg, as I’m saying, we need the web identify token provider, this one is the key for success success in type fab, there is different options, when it comes to use Spark Kubernetes you can deploy the jobs inside the docker, so every time that you build a document, you put the jar inside, so you know that this jar includes your jobs or the best, this is a lot of usage of artifactories, or the other option is you bloat your jars into S3, and then by using the Webby entity, you’re talking that this already integrated in the core of a spark, you can launch applications. So your docker is, your docker can never change, only if you want to take to the new passion of a spark and then you just put where is the jar located in S3? Then you can have different arguments and then define it the driver and the executer. If we will know, now to the operator, that’s either noble of schedule, you can schedule it. I was saying that they are in our path we were using Airflow and luigi, We are still using airflow, but also one capability that it provides is that you can say, okay, run every five minutes, and then you can define the concurrency policy, it’s like, hallo forbid or replace, hallo is that you can have multiple jobs running in parallel, imagine that you say every five minutes but usually take seven. So during two minutes you will have two, two jobs running in parallel, the forbidden it will forbid not to run the second until the first ends and the replace it will kill it, it will not allow to follow the, the, the execution. Then also you can have some, dress style policies in case that something goes wrong, you can say never, so that’s maybe the recommended one, when you are testing, don’t put always when you are testing, because I’m sure that you are going to do go to a crash loop on failure, that’s really good, you can define, okay restart after three, you know you have three life, every time that you crash, you lose one life, and at the end of the day, it just stop restarting or waste, You don’t get about the rester, if it fails, or if it succeed, you will start again. One advice here is that if it gets killed your job, try to monitor it, try to integrate it with it or data log or promifields or any other tool, because otherwise, maybe if you enter in a loop of crashes, then you will not notice because, and you will not receive any alert on that, so having alerts is a good recommendation. So how do we have it here? What is the flow that we are using a diagram we have, we make up a request once every, all the tests and everything is approved, we manage it, that obvious, the detects this mileage, I need it do, i does as with the assembly, it puts into S3, and then argo detects some changes in the specs, and then in the stars deploying it into Kubernete, there’s not this jar using the jar that is already in S3, and that’s all. It’s really simple at the end, it’s not difficult when I was preparing this talk, I was saying, Whoa, at the end, this is really easy it’s not difficult at all, but let’s talk about that. So we can, yeah deploy it manually, you know, I just play for two with automatically now manually, just when you are testing, you can do SVT assembly, it will provide the assembly share, then you copy it two or three, as I said and then you just run qctl applied on the JAMA that you defined it before that I explained in the previous slides. Once is done or when you are done, just delete it, always keep the cluster clean, don’t leave trash on the cluster. And what is that argo? Argo is really a declarative get ops, continuous delivery tool for Kubernetes, it’s really easy to use, it’s quite friendly, it’s more or less to what was a Spinnaker you know in the past, but I believe I was using Spinnaker and argo and I believe that argo is more fairly and easy to use. At least the, the learning group is faster. And at the end, it do the find some aspects and you kind of start now using it. Now that they’re in the only spec that you need to use is that one, not the spark application project, sorry article the application, and here you have all unit, you don’t need anything else. you can define some values, some the repo where you have the coat and argo will scan for changes, no one said detect some changes, then we will deploy the new aspects into production or development, now at the end this is how it looks, you have here we have right now, three aspike applications running in real time. Now, in that case, for the example, I just put one worker and one executer, and we have the master and the slave just to keep it simple here, but you have more, and also you can synchronize only the application that you want or all the set of applications that are defined in this repo. So that’s it, now, I cover it, the different options here is how it looks, I hope that you understand it, we can talk later in the Q and A, but at the end, it’s that simple? We met two master drivers triggers the bill and it loads the bill into S3 Argo the text that there has been a new measure and it deploys to Kubernetes. You can put it this automatically, or you can configure it to be manually just to maybe avoid some, so to avoid some people putting automatically changes in the production, at the end, you can put a human behind, but keeping in mind that you can also put it automatically so you just forget. So let’s go to the end, you know some learnings, as I said, it was really easy to set up, but with the right team and the right infrastructure. Now we were quite lucky to have a DevOps embedded in my team in the data platform team, I know that some people have some companies have an infrastructure team outside the team, in that case, it was really crucial to have the DevOps inside the team as at the end is one of us know, so he knows a lot of data, so it was easy to set up everything, to put everything in place in an infrastructural way, thanks to him, as I said you know, when I was preparing all of this talk, I was saying, Oh, I’m not sure if that’s a high level or low level, but at the end, it’s really easy. It was really easy to adopt from the, for the team, and also with the other teams that are in the data area, we can have different versions running in the cluster, we have a different needs and roles, so that was a good learning about moving from EMR to Kubernetes. Also it’s like we always have as part of testing cluster or we’re ready, it’s there, you don’t need to wait until the EMR is ready and we choose us then running nine, if there are no enough notes, you will need to wait some minutes, but there are, you will be able to run, and also I’m using a lens, I really good company that allows a really good product that allows you to monitor all the resources of your bots and their best learning, I don’t know if you are fans of Age of Empires now that adopt that DevOps in your team and combat it into data DevOps. I believe that that’s one of the best things that you can do because you have some DevOps that really knows how data works, then you are going to be capable of doing this amazing job. So to finalize things to all the team that has been in involved, as I said before, this is a cross functional team, we have, we have four lead engineers, two BI, Data house architects, one data ops and two tracking specialist. And also all this work is things to all the team that has been there for like this all the journey. So I hope that you like it and see you in the QA you’ll have here some links that they believe that it could be interesting for you. two medium posts, one from my teammate Carlos about how to deploy Spark history saving coordinators, and the one that told us about this talk and the specs and some audio information, so thank you everyone for being here.
Albert Franzi is a Software Engineer who fell so in love with data that ended up as Data Engineer Lead in the Data Platform Team at Typeform. He believes in a world where distributed organizations can work together to build common and reusable tools to empower their users and projects. Albert deeply cares about unified data and models as well as data quality and enrichment.
He also has a secret plan to conquer the world with data, insights, and penguins.