Building Data Quality pipelines with Apache Spark and Delta Lake

May 26, 2021 11:30 AM (PT)

Download Slides

Technical Leads and Databricks Champions Darren Fuller & Sandy May will give a fast paced view of how they have productionised Data Quality Pipelines across multiple enterprise customers. Their vision to empower business decisions on data remediation actions and self healing of Data Pipelines led them to build a library of Data Quality rule templates and accompanying reporting Data Model and PowerBI reports.

With the drive for more and more intelligence driven from the Lake and less from the Warehouse, also known as the Lakehouse pattern, Data Quality at the Lake layer becomes pivotal. Tools like Delta Lake become building blocks for Data Quality with Schema protection and simple column checking, however, for larger customers they often do not go far enough. Notebooks will be shown in quick fire demos how Spark can be leverage at point of Staging or Curation to apply rules over data.

Expect to see simple rules such as Net sales = Gross sales + Tax, or values existing with in a list. As well as complex rules such as validation of statistical distributions and complex pattern matching. Ending with a quick view into future work in the realm of Data Compliance for PII data with generations of rules using regex patterns and Machine Learning rules based on transfer learning.

In this session watch:
Darren Fuller, Developer, Elastacloud Ltd
Sandy May, Co-organiser of Data Science London and Lead Data Engineer, Elastacloud

 

Transcript

Sandy May: Hey everyone, and thanks for joining me on a talk about data quality today for the Data and AI summit. We’re going to discuss some of the things that we’ve built and how we’ve leveraged Apache Spark and Delta Lake, and a few other technologies to create this solution. So, just a quick intro to who we are. I’m Sandy May. I’m a Databricks Champion. I’m one of the co-organizers of Data Science London. I’ve done multiple speaking events across the UK. Some of my passions include Spark, Databricks, Data Security, Digital Security, Reporting platforms in Microsoft Azure. I’ll be joined today by my colleague, Darren Fuller, who is also a Databricks Champion, like me, he’s a tech speaker across the UK, and he’s one of the lead Data Engineers at Elastacloud as well. So, Darren has very similar passions, Apache Spark, Microsoft Azure, but he does more cool things like Raspberry Pi’s and hardware that I don’t do because I’m not as cool.
In our agenda today, what I’m going to do is, I’m going to go through what the problem statement is, what do we need, and how can we make this easy to use. So, not just developers are able to think about and worry about data quality, but we can make it a matter for everyone within the business. Darren’s then going to pick it up from where I’ve left off, after a demo in Databricks with some notebooks and some Spark, just to show how we can turn this into a suite of tools that we can actually start to think about data quality, and how [inaudible] impact for the meaningful to the business. It’s not just highlighting bad data. What we specifically learn whilst doing this and where you guys can go from here as well.
I’m just going to give a bit of background to what the problem statement is. There was a review by Harvard Business in 2017 that suggested, from the US alone, there was about a $3 trillion wastage from dirty data. Dirty data can encompass all types of things, from literal string times being [inaudible], to you expecting financial data and you’re actually getting paid child data, to things that are more worrisome where maybe somebody injected something into the data, or you’re not pulling down the APIs that you expect or something like that. But, it’s a really big problem and that’s just the US and that was 2017 as well. So after the pandemic there’s been a lot more working from home, we’ve probably generated a lot more data, I would have thought using the clouds and kind of the in-built tools that are tracking this, that maybe thinking without our digital security hat on that we might kind of leverage and cyber threats and things like that as well.
So, is the value to the business, which is number one, but there’s also what the actual problems can be. So it’s not always simple. It’s not just saying, oh, the state has got holes in it so, it’s irrelevant. It might be because it’s irrelevant, but a lot of data is very, very specific to the business domain as well. And that’s where it starts to get very tricky. You can’t easily write business generic data rules. You can maybe for some of the things that we can kind of know, and they’re related to maybe finance institutions or legal frameworks or things like that. So there’s a few different payment systems in the UK that you can easily kind of say, oh, I know how FPS or CHAPS or [inaudible] will work so we can create rules around that. But that just covers finance.
Then you also will have various different sectors retail, E-commerce that is going to be really hard to create generic rules around because, the way that those businesses work are completely separate. So this makes generic products quite tricky. And most things you usually have kind of manual intervention anyway, sorry I went one slide forward. So, dirty data can also be really frustrating for data scientists and BI engineers and anyone who is already using data or reporting on it. You’re only as good as your source of the truth. And if your source of truth is completely wrong, then in the worst case, you could have some bad analytics, sorry, in the best case you can have some bad analytics. But, in the worst case, you could be using that for very, very strategic business decisions. So, if we think about a company that’s predominantly E-commerce, what they might want to do is to branch into actual retail stores and what they may use as some data around where their web orders actually come from, who their customer base is and how that might mean.
Okay. We want to build a new store in this hub and we’re going to make it the biggest store that we’ve got. We’re not going to put it in London, we’re going to put it in Manchester because, we see that we’ve got a lot more orders from people in Manchester. If it turns out you’ve been reporting that wrong, and it’s not actually Manchester, it’s Newcastle and you’ve got a huge base in Newcastle, but not in Manchester, you have just wasted a lot of money building ground on a new site somewhere you’re not actually going to get any sales. And that has two business impacts with CAPEX and OPEX. So, it really can kind of snowball and then you’re chasing your tail as to what the reason for this was and how you can bring that money that you spent back to the business.
So there’s lots of things that data quality can lead to, which is why it’s quite difficult to kind of say, oh, it’s 3 trillion pounds across the US, that is a bit of a finger in the air to be honest but, you can see why that is what it is, it’s $3 trillion. There’s so much wastage that will happen for dirty data. So given that some of the questions then become, okay, we understand that there’s a problem, but what do we do about it. And usually as Devs there’s two options, more or less, unless there’s a cloud native tool, we’re using something like a Azure or GCP or AWS, if there’s not a built in tool, it tends to become, who has built something in the cloud that I’m using already, or is there an off the shelf product or should we build it. Our preference is always to build first just to make sure that if you need to create a service around it, you can build that service and you can get it kind of all hooked together and you’re not wasting money.
We generally find that things like this are quite easy to mock up and that’s what we want to show today. So, some of the benefits of doing a build are, you obviously own the IP, you own roadmap, anything specific for your business, whether your actuaries or your veterinary services, whether you are an E-commerce provider, whatever. You can build that generically for your business. And you can have confidence that it’s written for you, not for the whole world essentially. Obviously apart from the cloud fees or how you choose to run it. You’re not going to be paying ongoing licenses, [inaudible] contracts, and you can use your core technology. We’re obviously going to talk about Spark here but, Spark can run pretty much anywhere. So it’s nice to be able to know that you can do that. You’re not going to be locked into this can only ever run on AWS, or it can only ever run on GCP.
But on the other hand, there are reasons to buy as well. So, companies may have track record saying, what you’re looking at if you’re a finance company and theres an amazing data quality service for 99% of the finance data sets that you use, then maybe it’s worth looking at that rather than building it, depends how complex that 1% is. Bugs and issues will obviously be fixed by the vendor we don’t necessarily know how quick they are. You won’t have much of a say kind of in the feature roadmap, but you will tend to have service level agreements, which can kind of help you potentially recoup some of the costs that you may have from dirty data as well.
So, some of the key design decisions that we’ve had for this are that we need to support running Cross Cloud, whether that’s Azure, AWS, GCP, anywhere that Spark runs essentially. We want to make this really simple for people who understand data. So, Sequel developers, [inaudible] developers, Python, people who use Spark are our target always, but really anyone who’s comfortable with data should be able to use the tools that we’re trying to build, that’s kind of the business requirements here. We want to have a single reporting platform. So all of our data quality can come into one place and we can assess it as a business. Everything that we build, we want it to be reusable as well.
So if I build a rule that’s specifically about working out what the gross income is on a product and that’s net plus tax, I want to use that everywhere, I don’t want to define it for one data set and then have to define it 50 times to 50 different products. And most importantly, we want it to run as part of our data ingestion pipeline tool set as well. So for us, that data ingestion tool set, is the Microsoft enterprise data warehouse tool set. So, if you look for this diagram, you won’t find this because Microsoft had decided to update this with technology that doesn’t work, but essentially this is what we see as a business and most people see. So, you use data factories to chain everything together, source stuff in the lake, use Databricks to do that kind of landing to staging, staging to curated, and then create and serve your model.
So, really Databricks is the heart here, the actual spark engine, that’s where we wanted to find all of our rules. So I’ll flip over to a demo and give you a view of what this looks like. So this is the Azure environment that we’ve setup. So, really just using what we said back, that modern enterprise architecture. A few differences, we’ve got SQL DB instead of a Synapse, just because we’ve got small data at the minute but, everything runs through the data factory, to come into our landing and storage, and then we’ll run Databricks jobs over the top to stage and curate that data. So that’s where we’re targeting that, landing to staging area in this specific workflow. So, some of the data that we’re going to go through is for a charge in the UK that is put on top of the energy grid. So if you are a supplier of energy to the UK grid, you will have to pay a charge called the [sewers].
And the sewers is essentially a market charge to say, okay, you’ve generated some energy, I’m going to take like a levy as the National Grid just to pay for all the infrastructure that you’re using. This charge changes every 13 minutes dependent on supply and demand. So it’s quite hard to kind of predict, we take in lots of datasets to predict this charge and then push it out back to customers so that they can consume this prediction of what they’ll actually pay for the sewers. So what we tried to do is encapsulate lots of different parameters that we can pass to data breaks census spark, using notebooks. So using things that people can interact with. So, this would be set up kind of by your data set owner essentially, and we’ve just created templates out of everything. So it’s really simple to use and pretty much everything is parameter driven as well.
So, you’ll see at the top, we have all of these different parameters that will come in based on a job. And they’ll just update the date and things like that every time it runs which is daily. So we have this kind of concept of a bronze, a silver, a gold and a platinum boundary, and that’s a boundary for doing a file rating. So a platinum file we would just see it as there’re no issues with it whatsoever, that doesn’t change. And then we’ve created boundaries for bronze and silver. So, what 10 for bronze means is that if 90% of the records within this data sets are good or 90% or less it’ll be bronze. If it’s between 90 and 95%, it will be a silver rated file, and if it’s between 95 and 99.999999 it will be a gold file. So at the end of this run, we’ll get a rating for all the different files, as well as the overall flow.
Because, we know that with Spark jobs, we often get a lot of batch data and things like that, so we’ll give file ratings just in case your data comes from something like SAP exports or different kind of API loading processes, we’ll just do it at the file level but, also at the whole run level as well. There’s also like different metadata that we’ve just enabled that you can kind of pass through. So you can have those [inaudible] and from flows, this will represent like a business unit essentially, and what the dataset is essentially. So, this is the physical BM data, which is one of the data sets used to predicts the sewers. So essentially what we’ve tried to create is just something that explains what’s going on but, that the user doesn’t actually have to change, it’s just a notebook that brings a load of configuration down but, it kind of explains what’s actually happening, connecting to Sequel and things like that.
But that’s not the bit that we actually care about. What we’ve done is invested a lot of time into trying to make things really, really simple for users to use. Really, really simple from a generic standpoint. So we essentially just have a single folder that encapsulates all the rules for a specific run, and you can just copy rules into it and change them, passing again parameters in. So what this is going to do is to find a rule, which is going to check for null values. You can tell based off the name and there is only a few things that are user actually has to change here, they have to change a column name, what the table is, and what the actual source flow is, but a lot of these just get passed through in notebook flows as well. So the rule is very simple. It just checks that a column is not null, and every time the data set runs, it will check that the column settlement period, which is those 13 minute windows isn’t null in this dataset.
And then we’ve done this for all sorts of different types of rules, again, with the same concept of we’re trying to make this really, really simple for the users. We’re trying to make it really simple. So, if you know some SQL, you can come in and change this around. So, create a categorization, again, really simple saying that Spark is going to go through and it’s going to check that these values are present within a specific column. So, a column value has to be within these and that’s it. It’ll go through and it will do it again, really generic, it can be passed through. But the framework that we’ve tried to build is one that it just equates to a Spark rule. So, you can do whatever complex things that you want. You can do complex when otherwise statements, you can do a complex mathematical operations on that data.
And you can just validate that back as a rule. So again, somebody with a little bit of Spark knowledge, a little bit of Sequel knowledge should be able to come in, write something that roughly translates into a rule, and then that can just be run in the framework and it can go and it can do its bits. They can come and bring that data out at the end. So as part of what we had to do as well is, just to build a framework in the backend. So we just have a jar, that literally runs all of these rules from the notebook, turns into a really nice format. So we can then do reporting on it because, running rules is relatively simple when you build the framework, but if you’re just dumping data into a table, that’s really not valuable back to the business at the end of the day.
So, we just encapsulated all of that kind of merging and turning something essentially into facts and dimensions so you can shove it into a table. Just with a little bit of data manipulations we can have airtight reporting, we can report on top of the actual schema that we bring in as well. So, we can validate both from the source and the target side, whether something exists. So, we get like a really rich view of how that data actually looks and where the meaning gets to the business. And with that all, I’ll hand over to Darren, just to give you a bit of a flavor of what this kind of looks like from a reporting standpoint and how we can add like actual business value back from data quality as well.

Darren Fuller: Thanks Sandy. So, once we’ve collected that data from the Databricks notebooks, that Sandys just gone shown and it’s been pulled through into the database, we’re then going to want to show it, not only to our data owners but, to other stakeholders in the business so, they can see what quality is my data in today, is my data any good, am I making decisions based on the best possible quality data that I have available to me now. And that’s what we’re showing here in the power BI dashboards that we’ve gone and built. If you ignore the last validation date for now, this is an older data set that we’re using just purely for the purposes of this demo. So, the data sets Sandy was demonstrating was the physical BM data, which he picked deliberately because, it’s the smallest possible bar on this chart for me to go and pick on.
But if we go and select that one, we can start looking at well, how does this data compare to other data that we’ve had ingested in? And as we can see here, we’ve now narrowed this down to just the physical BM data, and we can see how this is changing over time. And this is useful for being able to see is my data relatively consistent in terms of the number of errors being reported, is it typically good quality data that suddenly dropped down in its data quality, or is this historically really bad data that’s just suddenly gotten very good. All of those things we might want to go in investigate without data source providers. Working from this as well, we can have a look at the kind of percentage that occupies a number of files and the kind of quality we’re looking at.
And also who are the data owners who are getting the best data, who’s getting the worst data, who do we know needs to go and talk to about making this a better place for us. So, as we can see here from the fiscal BM data, this is entirely with Antonio and we can see here of this, of Antonio’s data this is a significant proportion of the errors that he’s seeing. Other things that we can do down and this as well as we can start drilling through. So, if we look at the number of files by source and quality well, where’s the quality on this or from that we can drill down. And from that drill down, we can see, okay, well, this is good. Well, at least it’s not too bad. We’re almost at 20% of our data is coming through at gold quality standards, which as you can remember from Sandy’s part of the presentation is above 95%, but we’ve still got 80% within that silver range. And good for us we’ve got no bronze data here.
So, if I backed back up on this, what else can we do? We can also then start drilling through, into this as well. And we can have a look at things like what’s the flow of data looking like, and what, how does this affect our rules? So, if we drill down through, into the flow now, and we can see for instance that for this data source, we’ve had five gold files, we’ve had 43 silver files. How has that changed across time? This doesn’t look too interesting when you’ve only got two data feeds, but it shows how the data is varying based on the amount of gold data, silver data coming through. And we can start looking at, okay, well, what are these files looking like. And from these files, we can then drill through into our file drill. And this is going to start giving us information about that specific parquet file that was ingested.
So we can see this as one that came through on the 21st of August 2020, at 20 minutes past 11. We’ve got just over four and a half thousand rows failing. And if we want to get a view of what that looks like, we can then drill even further down into that and we can start seeing, okay, well, what were the rows that were giving us our problems here? And all of this is really useful information. So, from a data owners point of view, from a stakeholder, they can drill down into the exact file that’s been giving us problems. So, they can go back and talk to their data team. And from that, it’s not just, Hey, we’ve received some data and it’s not looking great. It’s we’ve received this data, it was in this file and around this time, and these are the exact errors that we were seeing and all of this is going to go and drive not only the investigation from their point of view.
So, did we go off and get the wrong data, did we get bad data, was there a fault when we were collecting it. But then they can also go back to the data provider and say, “Hey, were you having issues at this time?” And all of this is helping to build better relationships, get better data and we can drive forward better decisions from the end of this. So, once we’ve gone through and we’ve collected all that data, what do we do with the data? What have we learned from this? So let’s go through that. As Sandy went through earlier, this boils down to a buy versus build decision. More often than not if you can go and get something off the shelf that does exactly what you want and then let’s just go buy it. But data quality is unique to many, many businesses, not everybody’s rules are the same, not everybody’s data is the same.
Not everybody thinks about bronze, silver, and gold in the same way. So from that respect and as Sandy showed with the notebooks, just how quickly it means that you can build up those rule sets in this particular case than building it might be the better route and more than likely as the better route. One of the reasons for that is you can prioritize what’s important to you as a business because, you know your data best and if you don’t know your data best, then have a word, have a conversation with the data analyst, go and talk to the data suppliers until you get that understanding. Because until you know that data, you can’t really do anything with it.
Ultimately you might be wanting to look at, actually we do want to go and buy a solution. We don’t want to have to deal with system updates. We don’t have to deal with checking Spark versions, library versions. We don’t want to have that maintenance cost. So, we are going to go buy. Until we get there though, let’s see what we can build, just so that we have that interim solution where we’re building it up. We’re building up that data quality. We’re building trust in our data with our business. The reason it’s in Databricks and using their modern data warehouse as the data’s coming through, we’re building a pipeline in ADF. We’re taking data from multiple sources. We’re landing in our data lake, we’re transforming it. We’re running our data quality rules. We’re making that available to data analysts, to our data scientists and to our power BI users and other BI users.
So, by working this way, it’s really easy to run this as part of our ingestion pipelines. And as I mentioned already, there’s a lot of business value in these reports. Not only from just being able to see very quickly as we’re coming in each day, what does our data look like? But at the point we’re getting close to making that critical decision. Are we still working with the best possible data that we have inside of our business? And as we have a look at earlier, all these rules that we’re building up, well we can reuse them. Data is going to come through from a number of sources. Some of them are going to have to have an old checks. Some of them are going to have to have categorization checks. We’re going to have to go and look at, are our numbers within the right bounds. If we’re expecting something to come through as a temperature, does it make sense if it’s the temperature in the side of an oven, or if we’re dealing with freezing temperatures.
Are the extremes too far. So we built up that rule set. We build that catalog of it, and then a new data set comes along. We can just go to our internal shelf and say, I’ll have rule one, I’ll have rule two, I’ll have all three. And because we’re building on top of Databricks, of course we’ve got Delta in there for all about data needs. And we can use that for handling things like Schema Evolution.
So, new data source comes in or an existing data source has changed. Somebody is introduced and new field. It’s just going to come through the system and it will keep on coming through. And at some point we’ll say, actually we need a data quality rule on this. Let’s go and build one. But it’s not going to stop that process from working. And of course, Delta then gives you many other options in there, including such as well, when did my data change? Do I want to go look back in time and see what happens to this data a week ago? Once we have all of that information in place, we can start making more and better decisions about our data as a business and drive out better value from it. Thank you very much.

Darren Fuller

Darren Fuller is a Lead Engineer at Elastacloud Ltd and Databricks Champion. He started his career as a helpdesk admin after finishing his A-Levels, and has worked up from there over the last 2 decade...
Read more

Sandy May

Sandy is a Lead Data Engineer and CTO at Elastacloud where he has worked for 4 years on myriad projects ranging from SME to FTSE 100 customers. He is a strong advocate of Databricks on Azure and using...
Read more