When supporting a data science team, data engineers are tasked with building a platform that keeps a wide range of stakeholders happy. Data scientists want rapid iteration, infrastructure engineers want monitoring and security controls, and product owners want their solutions deployed in time for quarterly reports. Collaboration between these stakeholders can be difficult, as every data science pipeline has a unique set of constraints and system requirements (compute resources, network connectivity, etc). For these reasons, data engineers strive to give their data scientists as much flexibility as possible, while maintaining an observable and resilient infrastructure. In recent years, Apache Airflow (a Python-based task orchestrator developed at Airbnb) has gained popularity as a collaborative platform between data scientists and infrastructure engineers looking to spare their users from verbose and rigid YAML files. Apache Airflow exposes a flexible pythonic interface that can be used as a collaboration point between data engineers and data scientists. Data engineers can build custom operators that abstract details of the underlying system and data scientists can use those operators (and many more) to build a diverse range of data pipelines. For this talk, we will take an idea from a single-machine notebook to a cross-service Spark + Tensorflow pipeline, to a canary tested, hyper-parameter-tuned, production-ready model served on Google Cloud Functions. We will show how Apache Airflow can connect all layers of a data team to deliver rapid results.
– Hello everybody, my name is Daniel Imberman and thank you for coming to my talk, this is from Idea to Model Productionizing Data Pipelines with Apache Airflow. So to start out with, I just wanted to get a little bit about who am I. So my name is Daniel, I’m an Airflow engineer at astronomer.io and I’m a committer on the Apache Airflow project. I’ve also helped build data science platforms for the last five to seven years for companies like Apple and Bloomberg and a couple of others and with this experience, this is what I kinda wanted to go over with you today. So, let’s talk about the data ecosystem.
In most large tech companies, there’s three main players in a data ecosystem, you have the data scientists, the data infrastructure teams and the analyst slash business intelligence experts. Each of these three people have different necessities that will help them best do their jobs, but they all have the same goal. And so when it comes to the data scientists, data scientists just want things to work. They often come from a math background or a stats background and they’re not necessarily software developers, so they don’t wanna have really, they wanna be able to control things like CPU, GPU memory, but they don’t wanna have to learn how to use something like Kubernetes.
You then have the data infrastructure teams and the data infrastructure teams, their main focus is stability and scalability.
They want to be able to best support the data scientists while also ensuring that one person can’t bring down the entire system, they also want alerting and monitoring to ensure as much uptime as possible.
And finally you have the data analysts business intelligence, where these are the probably the least technical on the team, but they, these are the people who actually take these, models, predictions, insights, and turn them into value in the business. Oftentimes the data analyst wants easy access to the post-process data via for example, a SQL database where they can build dashboards and spreadsheet reports and hand those over to the business deciders, to make decisions based on those predictions.
To best facilitate the ability to get data scientists and business analysts the most flexibility possible but also keep them within the bounds of a reasonable system, we wanna build what I call a bumper rail model.
And the idea of a bumper rail model is to give as much flexibility as possible within bounds, but also alert quickly when a user tries to go outside of that sandbox and this is often where a lot of companies will start to build their own data science platform.
But I’m here to beg you don’t! Yes, your company will have a lot of it’s own business logic, you might even have your own machine learning libraries, but building and managing a data science platform from scratch is a monumental effort. Every single feature requests, every single bug ticket, you will be on your own to figure them out and so now the question is, how do you manage a flexible data science platform without having to build it yourself? The solution I’m coming to you with today is Apache Airflow and Apache Spark, namely, data orchestration and data processing.
Now, what is Apache Airflow? Apache Airflow is a workflow scheduler developed at Airbnb in 2015 and it’s a manifestation of configuration as code namely Python, Apache Airflow users can use Python to create directed acyclic graphs or DAX. These DAX are that easy to schedule scale and monitor via the Airflow system.
Apache Airflow is a very popular and, a very popular project with a very active community, we have a Slack that is constantly being discussed within by both the committers and the community and we have a dev list and we have a fairly active community on GitHub as well.
Now, the way that you build a DAG and Airflow is you define operators and you set dependencies. Now these, each of these operators is a unique task and you can build them one by one or in this case, we use a four loop to actually build a lot of them with just a small amount of code and once you have defined these operators, you can then set the dependencies such that Airflow knows which tasks depend on which other tasks, Once you’ve submitted your DAX to Airflow, you will have a single dashboard where you can see all of your, all of your workflows at once and this is very, this is great for monitoring and it’s great for kind of getting a single overview and it’s also the case that there’s tie-ins with a lot of monitoring systems like PagerDuty.
You can also see your DAG running in real time and see historical runs, so you can go back and easily see, Oh, that run didn’t complete three days ago, I need to backfill that to make sure I have as corrective data as possible.
With the, and probably the most powerful feature of Apache Airflow is its really robust library of community created operators. The Airflow community has created connectors to systems like Kubernetes, Hadoop, AWS, Spark, DAX and so many other services. And this wide list of connectors allows a data scientist with just simple Python codes speak to so many external data storage and data processing systems.
So let’s kind of go over what each of these stakeholders get out of a system like Apache Airflow. So we start out with the data engineers because these are the people who will be building the Airflow system and maintaining it over time.
So for datas engineers, Airflow offers a powerful tool for standardization monitoring and consistency. It also gives them the ability to basically create a data science platform that is unique to their company by extending existing operators. You can essentially take community operators and then add in your company logic to ensure that your data scientists are working within the bounds that are safe for them to work in. Airflow also offers easy integrations into elastic search and prometheus for log scraping and system monitoring and its scales. Companies like Airbnb and Lyft, run millions of tasks per day using Apache Airflow.
So when it comes to kind of taking away complexity, let’s take the example of submitting a Spark Job. Now this is the Spark submit command for the canonical, the classic SparkPi, but as many of us who have used Spark in production know, there are what as your use case gets more and more complex, there are a lot of configurations in submitting a Spark job. There’s all sorts of environment variables you can set and a different Java libraries, Python libraries that you would, might wanna include and that can get pretty large and it can get to the point where a data scientist would have a lot of problems, figuring out where they can edit the Sparks submit and where they can’t.
This can actually even get more complex, if you’re starting to work with say HDFS with Kerberos, where now you also have to do key tag management and you have to know how those systems are working to correctly, create the Spark’s submit to work with that encrypted databases.
And so a simpler way would actually be to extend an existing operator, so there’s a Spark submit operator as part of the upstream open source Airflow and what we can do is, we take that existing operator, we add in all of the configurations that we know are required and whatever custom logic we want so, if we’re dealing with Kubernetes, we can create a note affinity to ensure that this runs in a node with GPS or in the case of Spark, we can ensure that it only uses a certain key tap and by creating this wrapper, we are actually exposing far fewer arguments to our data scientists, to keep them within the bounds of what we know is safe.
In this sense, Apache Airflow actually simplifies the Spark experience because a lot of the configuration that goes into submitting a Spark job can actually be injected at runtime.
And so, since these custom operate, since a lot of these base operators are actually created and maintained by the community, this is where something like using Airflow instead of building your own data science platform kind of starts to pay off, in that with new versions of Airflow, there will be bug fixes on those operators, there will be updates for when a new version of the thing you’re connecting to comes out and all of this will, you will be having a much easier time staying up to date. And now building a data science platform is as simple as, writing a few plugins, writing a few custom operators and all the underlying infrastructure can be managed by the Airflow community, which you can then reach out to, whenever you need help.
Now let’s talk about what’s in it for the data scientists, the data scientists, they not only get this real benefit of a lower layer, just kind of defined by the data engineers that allow them to focus much more on the actual machine learning problems are working on, but it’s also very easy to parameterize pipelines and store historical pipelines in git. So now, it’s pretty easy to roll back and see what you’ve been working on historically and try out pipelines with different values. There’s a lot of operators that can handle a lot of the gnarlier ETL steps that are needed before you have that process data and you have the full flexibility of the Python language, which I’m gonna show an example of now.
So in this example, let’s say we want it to run hyper parameter tuning, so we have this linear DAG and we want to actually run those three tasks in parallel because we decided they’re not actually dependent on each other so we changed two lines of code and now all of those tasks are running in parallel, but we wanna actually know what the correct loss rate to use is so, rather than just having a single loss rate, we can create a four loop of all the loss rates we actually care about and essentially create a list of loss rates and then once we have created a full iteration for those loss rates, we can actually create a new task for each of the values we’re trying to hyper perimeter to now. And so, with four lines of code change, now you’re able to run something in massive parallel and also do the tuning you need and then converge at the end to find the best model.
It’s also, so this is also especially cool now as an Airflow one dot 10 dot 10, we have released the ability to actually in runtime when you run a DAG, give it a JSON object to override default so, now you can actually, as a data scientist, when you have the template of the DAG that you want, you just, you can actually fill in variables, so that you can run the same DAG multiple times with different values.
And finally, let’s talk about what’s in it for the data analysts.
So for the data analysts, probably the biggest thing is that when both the data scientists and the data engineers have more, have better uptime on their ETL pipelines and better monitoring to make sure that those pipelines complete, they have better assurances the data they are showing to the business is accurate. It’s also the case that with the Airflow API, the data engineers can actually expose endpoints, that allow the analysts to perform queries against the database and data sources on a scheduled basis.
So, now I wanna discuss what would go into creating a data science pipeline using Apache Airflow.
For this data science pipeline system, I’m going to use essentially the segments created by cookie cutter data science, which is that every data science, every data model requires three main steps, experiment, parameterize and productionize.
When it comes to experimentation, there’s no more
When it comes to experimentation, there’s no more pervasive tool I can think of than jupyter Notebooks. The ability of a pipe, like an interactive Python show, where you can easily just take the same data frame and do as many iterations on it as you’d like.
Now, where Airflow can tie in nicely with Jupyter Notebooks is, if you can offer Jupyter Notebooks with the same environment as your Airflow workers, you have this one to one, you have this one-to-one matching where let’s take, so let’s take the example where you have a Jupyter Notebook that’s communicating with a Spark cluster, if you set a lot of the Spark configurations to volume mounts into environment variables, you can actually have it that you offer them a Jupyter Notebook that speaks to the dev cluster, so you can run Spark jobs and DAG. And then when you actually want to upgrade that to staging or production, all you have to do is take that, take that Jupyter Notebook, put it in a DAG and then run that DAG in a staging or production Airflow instance and allow Airflow at runtime to inject the environment variables and Spark configurations necessary to speak to those separate Spark clusters. So once you have your task looking and acting the way you want it to, you’re gonna wanna parameterize it. You’re gonna want the ability to try out different data sources or try different hyper parameter segments. And this is where a project called papermill is really valuable. Papermill is a very popular project within the Jupyter ecosystem and it allows users to easily parameterize their Jupyter Notebooks.
So I’d like to show an example here where, now, we wanna be able to take variables A and B and parameterize them.
So all we have to do is set the parameter tag on the cell and the values are unchanged within the Notebook you’re experimenting on, but within Papermill, you can actually give those parameters, and what Papermill will do is take that input Notebook, replace those variables and output a new Notebook with those new values, This ties into the Airflow ecosystem as Airflow has a papermill operator, so now as long as you’ve tagged the cells you want to parameterize, you can, when you launch that Jupyter Notebook via Airflow, you can use the papermill operator and modify whatever variables you want.
And the final step of course, is productionization and this is where things like hyper parameter tuning come in to play, but I wanted to talk about a different machine learning model testing system that is also very common and very valuable and that is Canary testing.
Now, Canary testing is a method for ensuring that your new models are an improvement compared to the old models, but also a way of preventing models that might have errors from going through, so for example, if your model currently has an 83% accuracy rate and suddenly a new like, the next model has 97%, something might’ve gone wrong, you might have fit for the data, you might have, something might have changed and you’re going to wanna be alerted before you put that into production. So, with Airflow it’s very easy to build a system that allows for Canary testing. So this is an example where we built a Canary testing system, using what’s called the branch Python operator, and what the branch Python operator does, is it takes an arbitrary Python function and depending on the return value of that function, it’s able to pick between any number of branches to go down. So in this case, we have a model, we have a new model and if that new model is better than the old models, it will deploy the model. So this first run, it is determined, so it calculates the values for all of them, it compares them and in this case, it’s going to determine that the new model is better, so it deploys that new model But if we go back and we change it such that the old models are now a higher value than the new model, We can run this DAG again And this time it will alert you of a failure via PagerDuty or Slack or whatever integration you want to put in.
So, for those of you who are interested in getting involved in the Apache Airflow community, it is very active and very friendly, I highly encourage you to come to our firstname.lastname@example.org, we also have a very active dev list where all, a lot of the committers and PMC members are very email@example.com and both Apache Airflow and astronomer have a lot of really interesting documentation in our blocks.
So in conclusion, if you’re looking to build a data science platform, consider using Apache Airflow, it’s widely used it’s battle tested and it has so much of the infrastructure you would have to build for yourself. If you’re a data engineer, think about using the custom operators as a way of simplifying life for the data scientists and simplifying your life as you won’t have to answer as many questions and deal with as many files. And if you’re interested in trying Airflow out and you wanna try a vendor approved distribution, please reach out, we are always down to meet and discuss, creating a demo.
Thank you so much and once again.
Daniel Imberman is a full-time Apache Airflow committer, a digital nomad, and constantly on a search for the perfect bowl of ramen. Daniel received his BS/MS from UC Santa Barbara in 2015 and has worked for data platform teams ranging from early-stage startups, to large corporations like Apple and Bloomberg LP.