In this talk, I will show the range of data engineering challenges in acquiring accurate COVID-19 case data from hundreds of sources for an epidemiological study. I’ll walk you through how we mitigated these challenges using purely open source Python libraries (Great Expectations and Kedro). Together, they bring software engineering best practices to the experimental nature of Machine Learning.
Learn how to use these tools to guarantee data quality and eliminate pipeline debt.If you have to deal with data that has highly variable quality, and/or constant upstream changes, then this talk will award you with many more hours sleep!
Attendees are expected to have intermediate knowledge of Python and understanding of data engineering fundamentals to appreciate this talk fully.
Speaker: James McNiff
– [James McNiff] Hi everybody, and thanks a lot for joining. I’d just like to say it’s an absolute pleasure to be talking to you all today. I’ve attended a number of tech conferences over the past 10 years, including a number of spot conferences, back when I lived in Europe. This is my first time actually speaking at one, so it really is an absolute honor to be here even if we can’t all be together in person. So, today I’d love to talk to you about our journey through the COVID-19 situation in Australia and how we applied events, analytics to bring some insights. And most importantly, as data professionals, I’d really like to impart some practical knowledge and techniques to use that you can takeaway with you. That of course is very much applicable for your own projects outside of this case study, I’ll be using as the example today. So a very quick intro to who I am. My name is James McNiff. I’m a principal data and machine learning engineer with Quantum Black, based out of our Sydney office in Australia. We also have Cris Cunha on the line, who’s an analytics associate partner, and will be helping us with some of the questions in the Q and A section afterwards. So, for those of you who haven’t heard of Quantum Black, essentially Quantum Black, is our McKinsey Centre of Excellence for Analytics and Artificial Intelligence. We were born in Formula 1 racing, and we joined McKinsey in 2015. We are a team of data scientists, data engineers, and software engineers, entirely dedicated to bringing drastic performance improvements to our clients using advanced analytics, as well as a bigger purpose to try and bring positive change to the world with the application of these techniques and technologies. So, COVID-19, first of all, this shouldn’t be a surprise to anybody now, but of course COVID-19 has brought the world to an unprecedented time. As of today, there being 43 million cases around the world, most devastatingly 1.16 million deaths. Now, if we go back to the 11th of March, which you see in the bottom left of the chart, the World Health Organization declared COVID-19 global pandemic, and we started to track the growth throughout Australia. And as you can see here, around 10 days after the declaration, COVID begin to display exponential growth behaviors which indicated a very challenging time to head in our country, especially around the health system capacity, the livelihoods of Australians, and the at-risk, vulnerable populations. So we wanted to figure out how we could best bring advance analytics to help provide thought leadership to leaders in the country, and hopefully navigate through the many trade offs of various decisions which needed to be made. So, we stood up a team with data engineers, data scientists, and domain experts, and we focused on three very important points, and, of course, still are very important today for the future of COVID in the country. The first one, which you see on the far left there, is: How can we use analytics to understand the critical behaviors of the virus itself? And the dynamics observed in other countries which were already baffling with the exponential growth that we were starting to face. And this involves a significant descriptive analytics ranging from clustering analysis, to understand the specific infection periods, but also applying NLP to the modern faulty files and empirical research papers which were being published around the world. We are truly trying to understand the dynamics of COVID in the country. The second point, which you see in the middle there, is: How do we produce the most accurate representation of the trajectory of COVID in Australia? But also, how to source the transmission taking place, which is very important, as there are different levels of health interventions required to contain a certain type of transmission. So, this involves creating a probabilistic model, in RT, if they are effective– which is the reproduction number of the virus– not only ingesting the confirmed cases data but augmenting it with testing population flows and sources of transmission. Essentially, we needed the most accurate picture of the disease right now. And if you look at the third point, on the right-hand side: What of the potential scenarios for the Australian health system? So to do this, we use the insights extracted from the descriptive models, on the left, as well as the, trajectory monitoring models. And then we use that to project future possible scenarios for the health system, given that was the highest concern. So, could we meet demand challenges in the health system that the COVID would bring, in terms of the number of hospital beds and protective equipment. And to do that, we built S E I R type models, which, if you haven’t heard of them, are compartmental epidemiology models designed to stimulate the disease through time, from susceptible through to recovered. We also looked at the population demographics and social behaviors, around the time that the government started introducing these public health measures. But what I really like to stress here, is that this effort was actually only 30 percent. The other 70 percent was actually in the data itself. We’ve had a significant effort here, in data engineering, in overcoming the many challenges that we faced, or we would never have got to where we did. So, the rest of this talk is to highlight some of these, and really stress that quality data is instrumental to pretty much any model that you can think of. So, you can imagine it in a fast paced environment, such as this global pandemic, there are a large number of challenges when it comes to data. In particular, the inconsistencies between reporting times from different sources. And this of course, is compounded when you have a large number of sources. In our case, we had over a hundred at one point with different APIs, different government health websites, and different news articles. Of course, it’s very difficult because you have no control over the schema and the structure, which could change at any moment. And there’s no subject-matter experts to talk to. You can’t just walk over to somebody’s desk and ask them the best way to use this data. And you have to understand the reconciliation rules yourself, which you know, of course, takes a lot of guess-work and creativity. So, how did we get there in this, sort of, rapidly evolving landscape? Well, you can categorize these challenges into three main areas. Not all of these are technical challenges. Many of these you can’t solve with code alone. The first one is intellectual property. So, how do you ensure that you respect the owners’ and the producers’ copyright licenses and the terms and conditions? And you need to ensure that you ask for explicit consent of the owner, if required. And once you’ve got that permission, you need to ensure that you access the data in a conscientious manner. So, it’s definitely not okay to write a bot that would ingest data from a website on a recurring basis because that may introduce file service issues for the website owners. The second one is around trust. So, how do we show that, as consumers, the data that we’re ingesting is accurate and credible, and that the analysis that we’re putting out is also accurate? And you can think of trust synonymous with testing. So, you can test that data is up to date. You can test that it has what is required, the expected behaviors. And you can introduce monitoring, which I’ll show you in the demo soon. And it’s important that you build all this into your pipeline. The third one is on adaptability, and in my opinion, was contained some of the greatest challenges that we encountered, in terms of data on this project. For example, the data quality. A very specific example: We looked at the Johns Hopkins University data, which is a very well known source of data for COVID-19. It’s been cited by news articles all over the world. But for Australia, we saw the numbers, the case numbers, were very similar day after day, which, of course, we knew was not true because you’d look anywhere else in the country, and it was very contradictory. So, how do you deal with that in a production sense? The second one is on flexibility. So, ensuring that your pipeline can absorb those bumps from upstream. And on a daily basis, we’d see our data structures would change: different data type, different schemas, even down to the date/time format will be different. And, of course, this can, can break a pipeline. So, you need to ensure that you’re spending time on putting out outputs, and not spending time on you know, fixing the pipeline code every day. So, how do we go about fixing all these issues? Well, it turns out that, actually, we’ve solved many of these problems before. Software engineers, for example, have been using automated unit testing for decades. So, really the question becomes: How do we bring these best practices and experiences, that we’ve learned from the software engineering world, for managing complex code bases, and applying that to the experimental world of data science and machine learning. So, two of the open source tools that were instrumental in overcoming all of these challenges, were Kedro and Great Expectations. Kedro, you may have heard it before, but it’s essentially a library that helps you structure data and machine learning codes into reproducible pipelines. So, it helps you write production-grade quality code from the start, and it applies best practice from the software engineering models, and applies it to today’s science and machine learning. And this tool was actually created by Quantum Black, in-house. We created it to address many of the challenges that we faced in our own client engagements. So, it encompasses ten years of our collective experiences from doing this all over the world. And we actually opened-sourced Kedro last year. And you can get started by taking a look at the documentation of linked to the bottom there. The documentation on the GitHub page. The other tool is Great Expectations. Great Expectations was not created by Quantum Black. And it’s an awesome open-source platform library that allows you to essentially create units for your data. And, as it says there, always know what to expect from your data. And what that essentially means is that, well, software developers have long known that testing and documentation are absolutely essential for managing complex code basis. So, Great Expectations was designed to bring that same confidence, integrity, and acceleration, to data science and engineering teams. The aim is to allow you to add production-ready validation to your pipeline in less than a day, which is an area that can be massively overlooked, yet so integral. The old cliché: rubbish in, rubbish out, is very much true when it comes to data. So, it works with all the usual platform tooling such as, Pandas, Spark…As well as, we’ve made databases, such as, MySQL, Redshift, Postgre, and so on. You can orchestrate it with Airflow. And one thing I’ll mention here, is that Great Expectations doesn’t actually run your code, in terms of the pipeline. The idea is to bring compute to the data, so you can leverage your existing engine such as, Spark. And one thing we did was build a plugin that aims to create a seamless experience between Kedro and Great Expectations, so you can use them seamlessly together. That plugin will be open-sourcing in the coming months, but of course you can use either of these tools in the meantime. So, enough talk. And now, I’d love to show you some hands-on demos of both of these tools. So, you’re now looking at my terminal, and we’re in the root of the Kedro project. As I mentioned earlier, Kedro was one of the two major pieces of open-source software that were instrumental, and I’m coming up to those challenges that I mentioned earlier in the presentation. A few of the reasons why we use Kedro for this. The first was around reusable configuration. The second one was around the ease of collaboration with our team. And the third was that it helped us create production-grade quality code from the beginning of the project. So, one of the pieces of Kedro I’d love to show you today is the visualization. As you can see on the screen, I run it like this: Kedro viz. And I give it a specific pipeline to run, in this case, and that’s just because it makes it a little easier to see on the screen during this demo. So, if I go into my browser, essentially, what Kedro viz is doing, is taking the metadata from my pipeline, which, in this case, is a Python and Pandas pipeline. It could be Spark, or a mix of the two, and it’s built from the DAG on top. So, you can see here– I’ll highlight it for you– it’s made up of a number of different functions, and those functions are marked by an F, on the left-hand side. And a function, of course, consists of inputs, and it has outputs. So, the output of this function is another data source: this testing data. So, the function was an API call, and then, obviously you returned some data, and in our case, is written the data to a Parquet file. And you can see the name of the data source here. This data source is the input to another function, which I’ll mark here for you, and that’s essentially doing some cleaning on that data. And that goes on for you to the testing master table at the end, which I’ll just highlight. And that master table is the input to the machine learning functions further on down. So, why is this useful, and why am I showing you this? Well, the main reason is around visibility. And if I just zoom out, you can see, there’s quite a bit going on in this pipeline but it’s definitely not crazy complex compared to some of the pipelines you’re likely to see in production with many more data sources and a lot more functions but that’s sort of the point. It’s that you can see everything that’s going on. And if you imagine there’s a problem with the pipeline, of course, you need to resolve this as quickly as possible. And there may not be time to look through the code and see how it all pieces together. So, having a way to quickly visualize it. And you can see all of the knock-on effects, so if something’s broken, I can see all the pieces of analysis that are going to be compromised because of that, because of that issue. So, this is one of the major pieces of Kedro but I’m actually gonna park Kedro here for the rest of the presentation. So, you can find out more about it on GitHub page. There’s a few YouTube videos as well, which do deep dives which are really great, and there’s documentation too, but for the rest of the presentation, Great Expectations. So let’s flick over to my terminal. The first thing I want to say about Great Expectations is: What is an expectation? You can think of an expectation as unit testing for data. So, you’ll be very familiar with this from the software engineering world. It’s really bringing that sort of collective knowledge and experience in the software engineering world, applying it as a data science machine learning. So, one way to do that is you testing for data. So, expectation, add rules and tests against data. And Expectations Suite is a collection of those rules or tests. And what you can see on the screen here is a number of my data sources that I have defined them by Kedro pipeline. So, if I highlight one for you here, number 85–you see it’s in blue. Blue means there’s an expectations we apply to it. And white means that I haven’t applied one yet. So, how do I go about creating and editing these expectations? Well, there are a few ways to do it. The one I’ll show you right now: Kedro G E edits. And then I give it the name of the expectation I want to edit. So let’s go to number 65. Now, I can do this by the number, or I can put in the name like I just have here. I’m also going to do is, spin up a sort of Jupiter environment around Great Expectations, and my data, which gives me a nice environment, so you’d sort of see the expectations, in a, sort of, interactive way. Now, you can do for the Jason files as well, if you want Jason’s underneath this, but it’s quite nice to explore it using the notebook environment. And the other thing I’ll say here is that you don’t have to create the expectations from scratch. You, of course, can do that if you like but there’s actually a profile built into Great Expectations where you give it this source, and then it will run a series of profiling steps on top to understand the schema and the structure of the data. It will then recommend you rules based on that structure, which gives you a good starting point, to then go on, you know, edit for other rules from. So, this is the Jupiter environment it creates, and this is the notebook it’s spun up for me around my rules. So, I spun the… If I execute this first cell, it’s essentially giving me a snapshot of the data we’re looking at here. And this is quite a simple data source. It’s just looking at the case numbers through time, broken down by province, state, and the LGA name. And you can think of an LGA as a local government area, which is, sort of akin to ASML. Okay. So, if we scroll down, this is where the rules start to kick in. You can see here, I have my Expect Table schema. And what this is doing is, essentially, saying that the schema of the data should match this dictionary. So, the confirmed cases should be afloat. The knowledge notification-date should be a date/time, and so on. And this is extremely powerful because, can you imagine if this is running in a live sense or a live setting, and the schema changes? Maybe a column drops off. Then that could have huge ramifications down the line. So, it’s extremely important to understand that the data is, you know, as expected. If I run this cell, you can see here that the success is true because the schema actually does match. But, of course, this would fail if it didn’t. And the output of this is just Jason. So, you can use this in any way you like, and I’ll give you some examples as further on through the demo, but it’s an output of Jason. And you can embed that in your pipeline however you like. Okay. So if I scroll on further down, you see I’ve got a variety of tests. And by the way, the, the Expectations that I’m using here, there’s a huge list on the website. I can show you that at the end. You can also create your own. So, I have an example of my own custom expectation here. And this one is expecting that the data has been refreshed in the last X days. So, what does that mean? It means that for this column, the dates… I expect that the data has been refreshed up to a day ago. If I run this– I’ll show you the Jason output– it’s saying that the maximum notification day in the data was the 25th of October, whereas the minimum-required date was the day before. And that’s marked by the End Days parameter I’ve added here. Now, if I change this to minus one, obviously, that means tomorrow, the data hasn’t been refreshed because that’s in the future, so this should fail, right? If I run this, you can see here that it has indeed failed. So, that the maximum date is still today because when I last run the pipeline but then the minimum required day is actually tomorrow, so it’s failed. And this is extremely powerful here because you can have a sense of, that the data hasn’t just lagged behind by a number of weeks, which could have again huge ramifications on your analysis. So, I’m gonna leave it at that. I want you to see it’s broken so I can run it in my pipeline shortly. I’ll show you another expectation. This one is expecting that the date values are between a certain range. So again, we’re looking at my notification date and we’re essentially saying here that the minimum date is the first of Jan, which was when the data really started from, for COVID-19. And that the former is year, month, day. Now, if I change this one–let us run the rule first, and you can see it’s past, the success is true. Now, if I change this code to say… let’s say June. Run it. Obviously, it’s going to fail because there’s data before June. And what’s really useful about this is that well, one, has told me that this failed; success is false. It’s also telling me exactly why it’s failed. So, I can see here that the unexpected percent is 58.7%, which means that 58.7% that my data was actually before June. And then it goes on to tell me specific examples of where it’s failed. And again, hugely powerful. I can quickly find out, you know, which data frame has an issue. I can see the effects of downstream consumers on that data frame. I can then, very quickly debug the problems by seeing specifically which items of data are not fitting these rules. And then, you know, go ahead and fix it. So, we’ll keep going through some of these other rules. There’s another one here, which is expecting that the values are in the column. So, essentially this one is saying that values in a specific column should match this list. So, for example, the Province State, it is a list of Australian states. What I can do here, is add, say, California, to the list. Now, of course, California, isn’t an Australian state, so you would definitely expect this to fail. Let’s find out. Yep. Success is false. It’s failed, and it told me why it’s failed. So, the error, the column Province State does not contain the required value, California, as expected. And it told me that the actual value that’s in there that it doesn’t match the rule. So, we’ll leave it at that for the rules. What I’m going to do is, I’m going to run through the entire notebook, which will ensure that all of my rules are assigned to the Expectation Suite. And you can see if I just scroll down, the final line is here, and this is saving the Expectation Suite and applying it to my data. So, what I’d love to do now is show you the Expectation Suite we just edited running against live data. So, to run that, I type Kedro run, dash dash node, because we have more time for an entire pipeline I’m just going to run a single function. So, cases by LGA, and what will happen, I run this, it will hit the API, it will ingest the data, and then it will run that suite of rules against the data. And hopefully what you will see is failure for those free rules that we messed up in the notebook. So, it’s ingesting data, and you can see here that it’s now running twenty different rules against that data source. And you can indeed see that it’s failed. So, right here, I’m seeing the specific rules that failed, along with the columns that had the problems. So, for example: the addition of California, for one of the province States. And this is incredibly powerful. I mean, I know I just run this on my Mac, but imagine if you orchestrate this in Airflow. There’s a plugin for that on the site. You could also do it in the major cloud providers. Perhaps you want to set up notifications so that it sends you a text message or an email on failure. Or maybe you want to have like a really cool sort of ingestion dashboard, so you can see the status of all your jobs and down to specific data level. Or you could also have notifications in Slack, for example, so that your team are notified instantly of any issues. The last cool thing I would love to show you for Great Expectations is the documentation. So, this comes out of the box, and if I search here for cases by LGA you can see it generates this awesome documentation for me. And I haven’t had to do any of this. All I’ve done is define rules for the data and it has built this, sort of, data dictionary for me, which is incredibly useful. You can share that to your team. Perhaps you’re onboarding somebody new, or you need to have some record of, you know, the data and what’s expected. You use it to do that here. So, for example, the confirmed cases: it’s saying that that column is required, that it should be afloat, and the maximum value should be no more than 400. And if I go back to the previous page, and we look at the execution, what you will see here is the actual failures from this specific run that we just did. So, if I scroll down, it’s saying that it’s 85% successful because there have been failures. These are all the tests that have passed successfully. If I scroll down, you’ll begin to see all the problems we introduced. So for example: the notification date. It’s saying here that… it’s saying here that the max data is the 25th of October, which is the day, but the minimum required is tomorrow, and that’s because I set minus one as the End Days parameter. And what’s really useful, ff I scroll down further, it actually begins to tell me all of the counts of the unexpected values. So, you know, you’ve got 350 rows on the 28th of May, and so on. Let’s go a bit further. We’ll see the California issue that I introduced. So again, it’s saying here that California was expected, but we didn’t in fact have California in the data source. If you go to the Great Expectations website, you can see a list of all the Expectations that are available out of the box. You can see there’s a huge amount of them. If I scroll down, you’ll see some quite interesting ones around distributions which I didn’t use in that particular notebook I showed you. You can also use Multi-column Expectations as well, which are quite useful. And, as I say, if you find this list isn’t comprehensive enough, then you can also create your own Expectations quite easily with Python classes and functions. There’s some good examples of that in this documentation as well. So, you’ve seen two the tools which allow you to combat many of the challenges I mentioned. Just to summarize: You can use Kedro to structure your data and ML code into robust, reproducible, and scalable pipelines. And you can use Great Expectations to guarantee data-quality, structure and content together– as I’ve found out–that you can get many more hours sleep. And just before we wrap up, I’d like to say a thanks to Juliette O’Brien, who’s been doing a fantastic job of covid19data.com.au. It’s become the best place to get up-to-date COVID-19 data in this country, with very useful visualizations on the top. And Juliette’s actually been manually reconciling many different sources of data in the spare time to ensure it’s as accurate as possible. And it was a fantastic source to switch our pipeline over to after a few months of hard work trying to reconcile it ourselves. And finally, I’d like to say a big thank you to all of you for joining. It’s really a lot of fun. And if you’re interested in hearing more, you can you can definitely check us out on Medium. We put regular content on data and artificial intelligence. And then we also run meet-ups around the world, in many different countries. And of course, it’s much easier to join meetups in a virtual world, so yeah, if you’re interested, definitely feel free to join, and you can check it out on our website and our email too. So, with all that being said, we’ll now switch you into Q and A.
James is a Principal Engineer, specialising in Data & Machine Learning Engineering at QuantumBlack (McKinsey & Company).
With a decade of technical consulting, development and leadership experience, James has worked with several of the world's leading organisations throughout Europe, North America and Asia Pacific. Exposure across multiple industries including Pharmaceuticals, Energy & Minerals, Retail, Financial Services and Advanced Industries.
James has extensive experience building robust, highly scalable data & ML pipelines using Python, Kedro, Spark, Databricks, Azure and AWS.