Derek Gorthy is a senior software engineer on Zillow’s Big Data team. He is currently focused on leveraging Apache Spark to design the next generation of pipelines for the Zillow Offers business. Previously, Derek was a senior analyst at Avanade, implementing ML applications using Spark for various companies across the technology, telecom, and retail sectors. For this work, he received the Databricks Project Partner Champion award at the 2019 Spark+AI Summit. He has a BS in Computer Science and Quantitative Finance from the University of Colorado, Boulder.
The trade-off between development speed and pipeline maintainability is a constant for data engineers, especially for those in a rapidly evolving organization. Additional ingestions from data sources are frequently added on an as-needed basis, making it difficult to leverage shared functionality between pipelines. Identifying when technical debt is prohibitive for an organization can be difficult, but remedying it can be even more so. As the Zillow data engineering team grappled with their own technical debt, they identified the need for higher data quality enforcement, the consolidation of shared pipeline functionality, and a scalable way to implement complex business logic for their downstream data scientists and machine learning engineers.
In this talk, the Zillow team explains how they designed their new end-to-end pipeline architecture to make the creation of additional pipelines robust, maintainable and scalable, all while writing fewer lines of code with Apache Spark.
Members of Zillow's data engineering team discuss: