Building Data Quality pipelines with Apache Spark and Delta Lake

May 26, 2021 11:30 AM (PT)

Download Slides

Technical Leads and Databricks Champions Darren Fuller & Sandy May will give a fast paced view of how they have productionised Data Quality Pipelines across multiple enterprise customers. Their vision to empower business decisions on data remediation actions and self healing of Data Pipelines led them to build a library of Data Quality rule templates and accompanying reporting Data Model and PowerBI reports.

With the drive for more and more intelligence driven from the Lake and less from the Warehouse, also known as the Lakehouse pattern, Data Quality at the Lake layer becomes pivotal. Tools like Delta Lake become building blocks for Data Quality with Schema protection and simple column checking, however, for larger customers they often do not go far enough. Notebooks will be shown in quick fire demos how Spark can be leverage at point of Staging or Curation to apply rules over data.

Expect to see simple rules such as Net sales = Gross sales + Tax, or values existing with in a list. As well as complex rules such as validation of statistical distributions and complex pattern matching. Ending with a quick view into future work in the realm of Data Compliance for PII data with generations of rules using regex patterns and Machine Learning rules based on transfer learning.

In this session watch:
Darren Fuller, Developer, Elastacloud Ltd
Sandy May, Co-organiser of Data Science London and Lead Data Engineer, Elastacloud


Darren Fuller

Darren Fuller is a Lead Engineer at Elastacloud Ltd and Databricks Champion. He started his career as a helpdesk admin after finishing his A-Levels, and has worked up from there over the last 2 decade...
Read more

Sandy May

Sandy is a Lead Data Engineer and CTO at Elastacloud where he has worked for 4 years on myriad projects ranging from SME to FTSE 100 customers. He is a strong advocate of Databricks on Azure and using...
Read more