Building Robust ETL Pipelines with Apache Spark

Download Slides

Stable and robust ETL pipelines are a critical component of the data infrastructure of modern enterprises. ETL pipelines ingest data from a variety of sources and must handle incorrect, incomplete or inconsistent records and produce curated, consistent data for consumption by downstream applications. In this talk, we’ll take a deep dive into the technical details of how Apache Spark “reads” data and discuss how Spark 2.2’s flexible APIs; support for a wide variety of datasources; state of art Tungsten execution engine; and the ability to provide diagnostic feedback to users, making it a robust framework for building end-to-end ETL pipelines.

Session hashtag: #SFdev22

Learn more:

  • Real-time Streaming ETL with Structured Streaming in Apache Spark 2.1
  • Integrating Apache Airflow and Databricks: Building ETL pipelines with Apache Spark
  • Integration of AWS Data Pipeline with Databricks: Building ETL pipelines with Apache Spark

    « back