Building Robust ETL Pipelines with Apache Spark

Download Slides

Stable and robust ETL pipelines are a critical component of the data infrastructure of modern enterprises. ETL pipelines ingest data from a variety of sources and must handle incorrect, incomplete or inconsistent records and produce curated, consistent data for consumption by downstream applications. In this talk, we’ll take a deep dive into the technical details of how Apache Spark “reads” data and discuss how Spark 2.2’s flexible APIs; support for a wide variety of datasources; state of art Tungsten execution engine; and the ability to provide diagnostic feedback to users, making it a robust framework for building end-to-end ETL pipelines.
Session hashtag: #SFdev22