Josh comes from a varied background with a common thread of data obsession. He started in the finance world, working first as an analyst at a quant investment firm, then at Bessemer Venture Partners where he focused on investing in data and ML companies. Before founding Databand.ai, he worked as a product manager at Sisense, a high growth analytics company, where he built product capabilities geared toward data engineering teams. He started Databand.ai with his two co-founders to help data engineers deliver more reliable data products. He holds a B.Sc from Cornell University.
May 28, 2021 11:05 AM PT
As the importance of data grows and its connection to business value becomes more direct, data engineering teams are increasingly adopting service level agreements (SLAs) for how they deliver data, covering new factors like data freshness, completeness, and accuracy.
In this session we’ll discuss how to use Deequ, a data quality library that's purpose-built for Spark, to develop a data monitoring and QA system that will enable you to meet SLAs guaranteed to your analytics users, scientists, and other business stakeholders. We’ll cover how to use Deequ to create quality checks that report metrics and enforce rules on data arrivals, schemas, distributions, and custom metrics. We'll cover how to visualize, trend, and alert on those metrics using pipeline observability tools. And we'll discuss common challenges that teams face when setting up data quality logging infrastructure and best practices for adoption.
We'll use common examples such as machine learning, data transformation, and replication pipelines (such as moving data from S3 to Delta Lake).
With these tools, you’ll be able to create more stable, reliable pipelines that your business can depend on.