After years of studying Accounting, Mathematics, and Economics, Harper stumbled into the world of Big Data and has never looked back. Most recently, Harper has led Data Engineering teams in the NLP and Data Ops spaces, where he prioritizes folk’s psychological safety above all else. In his current role as Data Solution Architect for Databand.ai, Harper loves conversations around Data Engineering pain points and how best to solve them.
May 28, 2021 11:05 AM PT
As the importance of data grows and its connection to business value becomes more direct, data engineering teams are increasingly adopting service level agreements (SLAs) for how they deliver data, covering new factors like data freshness, completeness, and accuracy.
In this session we’ll discuss how to use Deequ, a data quality library that's purpose-built for Spark, to develop a data monitoring and QA system that will enable you to meet SLAs guaranteed to your analytics users, scientists, and other business stakeholders. We’ll cover how to use Deequ to create quality checks that report metrics and enforce rules on data arrivals, schemas, distributions, and custom metrics. We'll cover how to visualize, trend, and alert on those metrics using pipeline observability tools. And we'll discuss common challenges that teams face when setting up data quality logging infrastructure and best practices for adoption.
We'll use common examples such as machine learning, data transformation, and replication pipelines (such as moving data from S3 to Delta Lake).
With these tools, you’ll be able to create more stable, reliable pipelines that your business can depend on.