In this talk, I will show the range of data engineering challenges in acquiring accurate COVID-19 case data from hundreds of sources for an epidemiological study. I’ll walk you through how we mitigated these challenges using purely open source Python libraries (Great Expectations and Kedro). Together, they bring software engineering best practices to the experimental nature of Machine Learning.
Learn how to use these tools to guarantee data quality and eliminate pipeline debt.If you have to deal with data that has highly variable quality, and/or constant upstream changes, then this talk will award you with many more hours sleep!
Attendees are expected to have intermediate knowledge of Python and understanding of data engineering fundamentals to appreciate this talk fully.
Speaker: James McNiff
James is a Principal Engineer, specialising in Data & Machine Learning Engineering at QuantumBlack (McKinsey & Company).
With a decade of technical consulting, development and leadership experience, James has worked with several of the world's leading organisations throughout Europe, North America and Asia Pacific. Exposure across multiple industries including Pharmaceuticals, Energy & Minerals, Retail, Financial Services and Advanced Industries.
James has extensive experience building robust, highly scalable data & ML pipelines using Python, Kedro, Spark, Databricks, Azure and AWS.