Yuan Feng is a software engineer on Zillow’s Big Data team. He has been working on building the self-service platform to automate ETL building process, building business datasets, data processing libraries leveraging Apache Spark and Apache Beam. Before joining Zillow, he worked on building ETL/ML models in Tencent. He has a master degree from School of Computer Science from Carnegie Mellon University.
As the amount of data and the number of unique data sources within an organization grow, handling the volume of new pipeline requests becomes difficult. Not all new pipeline requests are created equal — some are for business-critical datasets, others are for routine data preparation, and others are for experimental transformations that allow data scientists to iterate quickly on their solutions.
To meet the growing demand for new data pipelines, Zillow created multiple self-service solutions that enable any team to build, maintain, and monitor their data pipelines. These tools abstract away the orchestration, deployment, and Apache Spark processing implementation from their respective users. In this talk, Zillow engineers discuss two internal platforms they created to address the specific needs of two distinct user groups: data analysts and data producers. Each platform addresses the use cases of its intended user, leverages internal services through its modular design, and empowers users to create their own ETL without having to worry about how the ETL is implemented.
Members of Zillow’s data engineering team discuss:
[daisna21-sessions-od]