Kevin Kho is an Open Source Community Engineer at Prefect, where he helps users with workflow orchestration. He was previously a data scientist for 4 years, most recently at Paylocity. Outside of work, he contributes to Fugue and organizes the Orlando Machine Learning Meetup.
May 28, 2021 11:40 AM PT
Data validation is becoming more important as companies have increasingly interconnected data pipelines. Validation serves as a safeguard to prevent existing pipelines from failing without notice. Currently, the most widely adopted data validation framework is Great Expectations. They have support for both Pandas and Spark workflows (with the same API). Great Expectations is a robust data validation library with a lot of features. For example, Great Expectations always keeps track of how many records are failing a validation, and stores examples for failing records. They also profile data after validations and output data documentation.
These features can be very useful, but if a user does not need them, they are expensive to generate. What are the options if we need a more lightweight framework? Pandas has some data validation frameworks that are designed to be lightweight. Pandera is one example. Is it possible to use a lightweight Pandas-based framework on Spark? In this talk, we'll show how this is possible with a library called Fugue. Fugue is an open-source framework that lets users port native Python code or Pandas code to Spark. We will show an interactive demo of how to extend Pandera (or any other Pandas-based data validation library) to a Spark workflow.
There is also a deficiency in the current frameworks we will address in the demo. With big data, there is a need to apply different validation rules for each partition. For example, data that encompasses a lot of geographic regions may have different acceptable ranges of values (think of currency). Since the current frameworks are designed to apply a validation rule to the whole DataFrame, this can't be done. Using Fugue and Pandera, we can apply different validation rules on each partition of data.