Exceptions are the Norm: Dealing with Bad Actors in ETL

Download Slides

Stable and robust data pipelines are a critical component of the data infrastructure of enterprises. Most commonly, data pipelines ingest messy data sources with incorrect, incomplete or inconsistent records and produce curated and/or summarized data for consumption by subsequent applications.
In this talk we go over new and upcoming features in Spark that enable it to better serve such workloads. Such features include isolation of corrupt input records and files, useful diagnostic feedback to users and improved support for nested type handling which is common in ETL jobs.