Empowering Zillow’s Developers with Self-Service ETL

May 26, 2021 03:50 PM (PT)

Download Slides

As the amount of data and the number of unique data sources within an organization grow, handling the volume of new pipeline requests becomes difficult. Not all new pipeline requests are created equal — some are for business-critical datasets, others are for routine data preparation, and others are for experimental transformations that allow data scientists to iterate quickly on their solutions.

To meet the growing demand for new data pipelines, Zillow created multiple self-service solutions that enable any team to build, maintain, and monitor their data pipelines. These tools abstract away the orchestration, deployment, and Apache Spark processing implementation from their respective users. In this talk, Zillow engineers discuss two internal platforms they created to address the specific needs of two distinct user groups: data analysts and data producers. Each platform addresses the use cases of its intended user, leverages internal services through its modular design, and empowers users to create their own ETL without having to worry about how the ETL is implemented.

Members of Zillow’s data engineering team discuss:

  • Why they created two separate user interfaces to meet the needs different user groups
  • What degree of abstraction from the orchestration, deployment, processing, and other ancillary tasks that chose for each user group
  • How they leveraged internal services and packages, including their Apache Spark package — Pipeler, to democratize the creation of high-quality, reliable pipelines within Zillow

 

Transcript

Derek Gorthy: Hi, everyone. Welcome to our talk on Empowering Zillow’s Developers with Self-service ETL. My name is Derek Gorthy. I’m a Senior Software Development Engineer on the Zillow Offers Data Engineering Team. I’ve been working at Zillow for about two years now. My background is in developing highly scalable data pipelines and machine learning applications. I have over four years of experience using Apache Spark and this is actually my second time speaking at this conference.

Yuan Feng: I’m Yuan Feng. I’m a Software Development Engineer at Zillow Group. I’m working on building tools and data set to empower users’ Big Data.

Derek Gorthy: So let’s take a look at what we’re going to cover in this talk. First, I’ll discuss how we think about self-service ETL. Next I’ll cover the core components of our self-service architecture, then Yuan and I will each go over one self-service implementation here at Zillow. Yuan will cover Zetlas, a self-service tool built for data analysts, and I’ll cover Zagger, a solution catering to our data producer and data engineering teams. Finally, Yuan will wrap up the presentation with next steps and key takeaways.
Here at Zillow, we’re re-imagining real estate to make it easier to unlock life’s next chapter. I’m sure many of you are familiar with Zillow from Zillow.com or the Zillow app, but we also offer customers an on-demand experience for selling, buying, renting, and financing with transparency and a nearly seamless end to end service. Zillow was the most visited real estate website in the United States, and we’re looking to hire 2000 people this year. So if you’re interested in learning about career opportunities, feel free to reach out to either Yuan or myself or visit our career booth at this conference.
So this diagram represents the broader context in which self-service ETL exists within Zillow. I’ll come back to it into a bit more detail later on, but I want to focus on this bottom right section for a minute. Last year at this conference, one of my colleagues and I presented on how Zillow Data Engineering design the next generation of our data pipelines. One core component of this design was a library of modular spark processing blocks that could be chained together in any order. This library gave data engineers and abstraction on top of spark, and a toolbox of common transformations. Throughout this talk, you’ll see how we’ve expanded on this idea of abstraction and modularity, this time at the orchestration layer, in order to make it even easier to build high quality data pipelines using the tooling that we’ve developed.
So what is self service ETL? No code ETL solutions have been around for years, so when we talk about self-service ETL, we mean a tool that can transform a definition of a data pipeline from some sort of user interface and output actual executable code. The goal here is to abstract away the complexity of orchestration, data processing, CACD, and other tasks that go into getting from that user interface to a deployed pipeline.
As we thought about what a self-service ETL solution should do, we broke down the process of transforming a configuration into a functioning pipeline into five steps. I’ll briefly go over each of these steps here, and then we’ll take a deeper dive into each later in the next slides.
So the first is the user interaction layer. This is where the user implements the pipeline configuration. It can be anything from a fully fledged UI to a JSON blob that’s checked into a Git repo. The second is the interpretation layer. This is the process of taking a config, parsing it, and inferring any details of the pipeline definition that were not explicitly stated. The third is a pipeline metadata store. This is a persistence layer that stores the parsed pipeline definitions. The fourth is a rendering layer. This is the process that takes the metadata representation of a pipeline and constructs an executable file or files from the pipeline definition. The fifth layer is the orchestration and execution layer. This is the deployed pipeline that runs on a schedule on the batch case, or is processing streaming data in the streaming case.
You’ll see that we bucketed these steps into two categories, opinionated and unopinionated. So by opinionated, we mean that this is the part of the process that is dependent on the structure in which the pipeline is defined. It is subjective and is built to accommodate its intended audience. By unopinionated, we mean that anything beyond this point is objective and it has no awareness of the input structure of the opinionated component. In short, this allows data engineers to bring their domain knowledge of orchestrators, spark jobs, or CI/CD to create modular reusable components that are used by the opinionated systems.
So now that we’ve looked at this holistically, let’s dive into each component. The user interaction layer is where we tailor the experience the users of the UI. Yuan will demo one of these UIs looks like later on in the presentation, so you’ll get a much better idea of how we’re understanding the needs of a user group, and then tailoring this front end to those needs. We wanted to move away from the one size fits all model. That seems to be common an open-source self-service ETL solutions that we evaluated. For example, data analysts prefer a UI experience that shows what processing blocks are available to them and provide some of the most common configurations to these blocks and a simple, easy-to-use interface.
As a data engineer using this service, I really don’t need a fully fledged UI to create my pipeline. I can interact with tools like Terraform that allow me to create multiple pipelines quickly and more quickly than you’d be able to through a UI. And I’ll also want to be able to provide my own custom code or configurations when needed. The next point that we want to underscore here is that the UI must improve the developer experience of its users and that different groups of users need different things from a user interface.
Moving onto the interpretation layer, this is where the configuration from the UI is parsed into a standard pipeline definition that can be consumed by the downstream process. One important point here is that given the modular design of this system, we can actually implement any number of parsers. So the only requirement of a parser is that it generates a standard pipeline definition and how it gets to that is entirely up to the interpreter.
To see how an interpreter can be useful, here are a few examples. So let’s say that I want to provide users a few options for when their batch pipelines can run. Let’s say hourly, daily, and weekly. I can provide these three options in a dropdown and the interpreter will take care of the conversion from each option to a chronic suppression that is required by the orchestrator. For a second example, let’s say that a data producer team has asked me to create a processing block that will transform their JSON data into parquet format. As a stop gap, I can quickly create a script that does this conversion, add it to the interpreter, and make it available to the team. Over the course of the next couple of weeks, I then develop a well-tested Piper data format conversion block, and I want to replace the original script that had been used as a stop gap.
All I need to do is update the interpreter to reference my block, and the next time the pipeline config is synced, the pipeline will automatically be updated to reference my well-tested block instead of the stopgap script, without the data producer team needing to make any changes on their end. So these are just a few examples, but we found that embedding assumptions into the interpreter allows for a much simpler pipeline configuration experience.
Moving onto the pipeline metadata layer, this is a persistent store unopinionated standard pipeline definitions. This intermediate step serves a couple of purposes. So first, if a user is looking to create an airflow pipeline, they only need to create the parser that generates the standard pipeline definition, and I’ll cover why that is in the next slide. Second, this allows us to start building tooling on top of the metadata store. If we want to go back and look at all of our data pipelines, spread across multiple orchestration engines today, we now no longer need to go back into each individual engines meta store.
Instead, every pipeline definition is stored in a central location, and this allows us to better understand things like data lineage, the dependencies between pipelines, and how we can more efficiently schedule pipelines when we understand their upstream dependencies. Third, this allows us to serve up current pipeline definitions across multiple user interfaces. If a team has created a pipeline through one user interface like Terraform, it could later be viewed through the Zetlas UI, giving users a seamless experience across our platform.
Moving on to the rendering layer, this is the stage that’s responsible for transforming the abstract standard pipeline definition into executable code. As we build out support for more orchestrators, we now only have to implement orchestrators render once. Let’s dig into why this is a simple but powerful concept.
So first, this allows for the flexibility to move between orchestrators. Traditionally, if you wanted to move from one orchestrator to another, you’d need to recreate that pipeline file in the new orchestrators format. You basically need to entirely rewrite that pipeline for the new orchestrator. This process is tedious and there’s a lot of work that goes into checking that the new pipeline is equivalent to the old one. However, if you go through our pipeline renderer, you have the ability to produce that pipeline in any supported orchestrator without needing to understand that orchestrator syntax. Now, second, this solves the problem of major version upgrades and an orchestrator where things like package structure, class names, and function inputs are likely to change. Instead, you can build on a renderer for the new major version, and in bulk, upgrade your pipelines to that new version. Now finally, we have the orchestration and execution layer. This is where the pipeline is deployed and how processing steps are run.
In our presentation last year, we talked about how one of the key components of our next generation data pipelines is the development of our Piper processing library. In order to standardize and improve the quality of our pipelines, we’re continuing to develop a shared Apache Spark processing library that provides plug and play processing blocks. These processing blocks can be arranged in any order and provide data engineering and other teams the ability to run a high quality Spark code with no development costs. Like I said earlier at the top of this presentation, this self-service ETL solution is an extension of that same concept where we’re looking to provide an abstraction on top of the orchestration layer, so that it’s even easier to create high quality data pipelines. And with that, I’ll hand it off to Yuan to discuss Zetlas.

Yuan Feng: Thank you, Derek. Zetlas is shortfall Zillow ETL as a service. The motivation of building data sets for why they are Sunkist analysts and anyone who doesn’t necessarily want to code with a reliable UI based self service to automate the Saco based workflows. Is that what us as multiple features first day, since it is the UI driven users can quickly prototype and deploy their jobs also is that a support job monitoring and alerting functionality, which can send alerts through email opportunity when a job fails, our job fails to finish within a certain amount of time. Besides that, Zetlas has a job validation service, which can catch most of the errors before a job can be successfully submitted. In addition, Zetlas is integrated with multiple internal services like data portal though act team ownership, luminaire data contract platform. And so on. Finally, Zetlas is scalable and expandable. We are continuously adding new task type to Zetlas to expand its functionality. In the next slides, lets watch a video to see how that as the UI works by creating a new job.
Now I’ll go through a job creation process to demonstrate how Zetlas works. We need to click this neutral pattern first, and it will lead us to a job creation page. We need to fill in some basic information for this job, like the job name and job description. Then we need to set up a team owner for this job. And then the job schedule. We have some preset schedule here. Let’s say it is a daily job, and we can see it will be kickoff at 12:00 AM in the morning every day. Then we’ll go to the job, start date. The job start either be a future date or previous day or current date. So if we select future date means that the shop will only start running in some date in the future. But if we set out some past date it means that this job is a backfield job and it will do it all the backfield starting from the job star date you’re given.
And this one is the how long we are expecting this job to be finished. Let’s say expecting this job to be finished within one hour after one hour, if the job hasn’t been finished and it is still running, and it will send an alert to the forward email address you gave me, that’s giving the email address here. And the next thing we need to do is to submit a task, task is a basic execution component within one job. And within one job, we can have multiple task. And then we can define the dependency relationship between different tasks. We have a list of different task type to choose from basically from three categories. The first category is the creation based apart, three different quarries, high Presto and spark sequel.
We can also say dependency here. You can set dependency on data, set through location or have table. And we can also do table validation here. We can validate a table and they can also publish the availability of one data set. In this demo, create three task one from each category, let’s say there’s a dependency.
Let’s say we’ll wait for one data set to be exist before executing the fallen tasks. So this task will be waiting for this table and we will only check if the table exist. And this table is a non partition table, we won’t check this. Click okay. For the second task or use the first category, creation. We want to create a new table based on the previous old table, we call it, create new data set task, and we can pull the query here. Basically it is creating a new table based on all the data from old table, because it is a demo data set that doesn’t contain PI data. So for the period hacking part, we will select no PI. And then the next thing is dependency because this task can only be wrong after the first target is successful. So it will depend on the first top, which is the wait for it. Zetlas also support core validation, if your [inaudible] has syntax arrow that says, will you send me the task? Zetlas UI will stop you and show you where the arrow is.
And then we will create a third task from the data quantitate category. They will publish the data set, after the second test is done. Let’s call it publish data set, and we are publishing the availability for this particular table. And it will depend on the second task. And this is all the information we need. Then they click submit. As we can see, the new job is created, apart from the job details. If you want to check the running history of each job, they can go to the history tab. If I want to check the task level running history, they can expand here. On the top right side, they have the control panel where we can edit and update a job. We can pause and non pause a job and we can delay the job. This is a quick demo onto that class and lets back to the slides.

Derek Gorthy: Thanks, John, with all things Zillow, we love our Zs. And with that, I’ll talk a bit about Zagger. Zagger is aimed at providing a developer friendly abstraction on top of ETL tools. Primarily it constructs the actual pipeline from a config definition and automates a lot of the ancillary tasks that data engineers do when creating a pipeline. We chose Terraform as our UI layer, as our users are familiar with infrastructure as code, and we can use things like auto-complete and entity relations that Terraform provides. We’ve created Terraform resources that cover common pipeline patterns that we see allowing teams to quickly stand up a pipeline that follows a common structure. Terraform talks with Zaggers rest API. And we’ve developed this API first, allowing us to focus on building integrations and onboarding as many new pipeline processing blocks as possible. Our target users, primarily our data producer teams within Zillow that want to create their own data pipelines, but we’re also designing it, such that data engineers like myself can use this tool to enhance their development experience.
Now let’s dive a little bit deeper into the actual structure of Zagger and the integrations that it provides. When we refer to Zagga, we’re referring to two things, so first is a Zagger pipeline utility is or Z pub that we see at the bottom of this slide. This is where all of the actual parsing, rendering and data pipeline templates are implemented. We decided to go with this library approach because we first and foremost want to develop tooling that everyone within Zillow can use without being dependent on our service. We see our service as one potential avenue through which a developer can use our package, but we don’t want it to be the only avenue. Second, we have the, the Zaggar managed service or ZIMS. So this is the service that we run, which when access through an API will parse, render and deploy the pipeline.
Working left to right, we’ve designed this service with an access layer that’s UI independent. Through standard rest access. Now user can create the pipeline through Terraform, Zetlas or even get labs CICB we want to make the service as flexible as possible to allow for more UI built UI tooling to be built on top of it. And looking into the ZIMS section at the center of the slide, I’ll cover a few key components of this design. So first the surface supports multiple interpreters and multiple renderers because these components are modular users automatically get the flexibility of zip-up just by using the managed service. Second, the managed service aims to automate or augment a lot of the ancillary tasks that data engineers do today. While we just have one integration shown, I’ll provide a couple of examples to better explain as a data engineer, I need to create a hive table when I’m onboarding a new data source and want to persist that data into a table.
The managed service is aware of the hype table is referenced in the data pipeline and given a schema, it will actually go out and create that hive table if it doesn’t already exist. If a user wants to integrate with Zillow’s internal data quality platform, the managed service will check to make sure that the data contract ID referenced in the pipeline is valid. Moving to the bottom right of this diagram, we have the integration with execution. The managed service provides a way for users to leverage the internal processing libraries that our data engineers have created.
As more processing steps are developed, we can continue to support those steps and more quickly give our users the tools that they need to create high quality pipelines users can also configure which clusters their jobs run on allowing teams to create and monitor their own production resources. Finally, we’re looking to consolidate monitoring of production pipelines. Users of CMS will be able to view the status of their jobs, monitor production failures, and view performance metrics other pipelines in a single place. We do this by routing spark job logs, air flow, or other orchestrator metrics and internal monitoring services to a single place so that teams can visualize how their pipelines are performing in real time. And with that, I’ll hand it off to Yuan to talk about next steps and close us out.

Yuan Feng: Thanks Derek. They started self service for our platform development process back in 2019, it’s the overall development process complete to couple and divide it into two parallel stop process. One for packer, which is a share spark process library. And other one is for the service, which renders the logic based on the processes’ library like that, us and Zeggar. Zetlas was started and launched in 2020, it provides the users with a UI based tool to create pipelines, the development of Zeggar managed service and the pipeline utilities package elaborate was also started in 2020. And it provides usersv with additional access layers from Terraform and Zagger. In 2021 Zillow, which is the first word of self service platform [inaudible] has been retired and replaced by Zetlas. We are currently working on the unification of Zagger and Zetla spec end. They also want to have other goals in 2021.
The first takeaway from our petition is that UI design need to solve the pinpoint for the users when they want to create a pipeline is already in your code. We also learned that self service ETL has bigger demands and the wider audience, both engineers and non-engineers engineers need self service ETL platform accelerate their workflow. The third one is that the modularization and the separation of the whole self-service detail pass fall into different components has many benefits like improved readability of components, improved development, efficiency, and easier to track the issues.
So that’s why it’s that the obstruction from the two specific implementation can provide better flexibility to satisfy user requirements and the separation between the opinion layer and our opinion layer provide us base the ability to expand if we want to introduce the new opinion layer in the future. More from Zillow on thus day and we’ll talk about democratizing data quality through a centralized platform at 3:15 PM, PST. [inaudible] will talk about our total amount driven anomaly detection with luminaire at 5:00 PM PST. In the end, please, don’t forget to rate and review the session. Your feedback is important to us they help us to improve. Thank you so much for your time any questions?

Derek Gorthy

Derek Gorthy is a senior software engineer on Zillow’s Big Data team. He is currently focused on leveraging Apache Spark to design the next generation of pipelines for the Zillow Offers business. Pr...
Read more

Yuan Feng

Yuan Feng is a software engineer on Zillow’s Big Data team. He has been working on building the self-service platform to automate ETL building process, building business datasets, data processing li...
Read more