Derek Gorthy

Developer, Zillow

Derek Gorthy is a senior software engineer on Zillow’s Big Data team. He is currently focused on leveraging Apache Spark to design the next generation of pipelines for the Zillow Offers business. Previously, Derek was a senior analyst at Avanade, implementing ML applications using Spark for various companies across the technology, telecom, and retail sectors. For this work, he received the Databricks Project Partner Champion award at the 2019 Spark+AI Summit. He has a BS in Computer Science and Quantitative Finance from the University of Colorado, Boulder.

Past sessions

Summit 2021 Empowering Zillow’s Developers with Self-Service ETL

May 26, 2021 03:50 PM PT

As the amount of data and the number of unique data sources within an organization grow, handling the volume of new pipeline requests becomes difficult. Not all new pipeline requests are created equal — some are for business-critical datasets, others are for routine data preparation, and others are for experimental transformations that allow data scientists to iterate quickly on their solutions.

To meet the growing demand for new data pipelines, Zillow created multiple self-service solutions that enable any team to build, maintain, and monitor their data pipelines. These tools abstract away the orchestration, deployment, and Apache Spark processing implementation from their respective users. In this talk, Zillow engineers discuss two internal platforms they created to address the specific needs of two distinct user groups: data analysts and data producers. Each platform addresses the use cases of its intended user, leverages internal services through its modular design, and empowers users to create their own ETL without having to worry about how the ETL is implemented.

Members of Zillow’s data engineering team discuss:

  • Why they created two separate user interfaces to meet the needs different user groups
  • What degree of abstraction from the orchestration, deployment, processing, and other ancillary tasks that chose for each user group
  • How they leveraged internal services and packages, including their Apache Spark package — Pipeler, to democratize the creation of high-quality, reliable pipelines within Zillow

[daisna21-sessions-od]

The trade-off between development speed and pipeline maintainability is a constant for data engineers, especially for those in a rapidly evolving organization. Additional ingestions from data sources are frequently added on an as-needed basis, making it difficult to leverage shared functionality between pipelines. Identifying when technical debt is prohibitive for an organization can be difficult, but remedying it can be even more so. As the Zillow data engineering team grappled with their own technical debt, they identified the need for higher data quality enforcement, the consolidation of shared pipeline functionality, and a scalable way to implement complex business logic for their downstream data scientists and machine learning engineers.

In this talk, the Zillow team explains how they designed their new end-to-end pipeline architecture to make the creation of additional pipelines robust, maintainable and scalable, all while writing fewer lines of code with Apache Spark.

Members of Zillow's data engineering team discuss:

  1. How they identified pain points in the development, maintenance, and scaling of their data pipelines
  2. The advantages and disadvantages of the ETL patterns considered
  3. How they ultimately leveraged their experience to architect more scalable, robust data pipelines using Apache Spark