Data Quality With or Without Apache Spark and Its Ecosystem

May 28, 2021 11:40 AM (PT)

Download Slides

Few solutions exist in the open-source community either in the form of libraries or complete stand-alone platforms, which can be used to assure a certain data quality, especially when continuous imports happen. Organisations may consider picking up one of the available options – Apache Griffin, Deequ, DDQ and Great Expectations. In this presentation we’ll compare these different open-source products across different dimensions, like maturity, documentation, extensibility, features like data profiling and anomaly detection.

In this session watch:
Serge Smertin, Resident Solutions Architect, Databricks

 

Transcript

Serge Smertin: Hello everyone. My name is Serge and I’m gonna be speaking about data quality with or without Apache Spark and its ecosystem. So, overall we will talk about dimensions of data quality, about the frameworks and some short summaries about the things.
So quick about me, I’m Serge Smertin. I work as senior residents solutions architect at Databricks Amsterdam and I’ve been working in all stages of the data life cycle for the past 14 years. I’ve built data science platforms from scratch, and I’ve analyzed malware through my super skills, data forensics. And I’ve built on anti-PII analysis measure for payments industry. And here at Databricks, I’m bringing strategic customers to the next level as the full-time job.
So data quality, requires certain level of sophistication within the enterprise to even understand that it’s a problem and this quote was from 2006, but it’s actually pertains even to the nowadays, because what really happens is that the data team deploys, the first pipeline through production, everyone is happy, then they deploy second, third to fifth and tenth pipeline to production. But then, they started thinking, Hmm, are we actually monitoring the data? Is our ETL pipeline actually healthy and robust enough for production use for other teams to trust the data sets that’s produced by this pipeline.
And then the whole big picture of the belt of management comes into play. So, almost every enterprise I’ve seen or worked in or worked with had similarish picture and let’s break it down to the most fundamental parts. As you can see, everything revolves around the enterprise and their ETL process, which is extract, transform, load your data, and ETL process is really something huge.
It’s usually orchestrated by a different two links like Azure data factory or airflow or adopt the Briggs job framework or homegrown job framework. And essentially ETL just moves the data around. Well, fine, not that unusual so far, but then the companies who are thinking about data quality start thinking about introducing data quality checks, and then they ask us, okay, where do we need to run, data quality checks and how? Well, the simplest answer that will get you far along the way is to include data quality checks along in your ETL process. So whenever you create a table, make sure to define some of the data quality rules within the same file or within the same class. So that’s every time you get the new increment of the data, you also record metrics about data quality, where we have unique records, big number of full records with missing values, base number of records, their own format. Of course, you don’t want any records that have their own formats, or you don’t want records with duplicate values. But yeah, it’s realigned data integration.
And so things happening production it’s more important to move fast and iterate fast, and so don’t stop until you’re done. Then building, the perfect system from the beginning and deliver it in a year and not within a month. So, if you move fast enough and add data quality a bit later, you are very good to go. Well, you record your data quality metrics more about in the following slides. You have to do something with your data quality metrics. Of course, you need to display them in dashboards that will be presented to your on-call teams as a contextualized links. So your dashboards might be in a different tooling slide, RBI or Databricks sequel analytics or things like Tableau or Looker where you just show number of records or the curve of the data on how it looked over the time.
And essentially you need to deliver the links to these contextualized dashboards through some of the alerting mechanisms. So you can do many different ways of alerting would then the modern day ecosystem. But we would recommend you to the standard tooling that you already have in your organization. So it’d be Nagios or Azure monitor or Amazon simple notification service or something else just to use it, make sure to include a component in your alerting system that might be called something like noise gate filter. What is a noise gate filter? So imagine, you have 10 people on call that are looking at all of the alerts, not only about your pipeline, but also about your website latency or the health of your virtual machines or things like that. And people who aren’t cool are so stressed and they are looking at so many things. That’s the smaller number of alerts you’re sending them the better it is.
So having a successful machine learned noise filtering mechanism in your alerting framework is essential for your on-call people to trust the alerts from your data quality. And it’s also essential for the upper management to have the trust in your data, because essentially whenever things go south, people who are on call, even you, can receive the alert in a proper time and then fix the problem. The other thing you might ask, well, we have quality checks, but how do we actually make them? And then a concept of data profiling comes into play. Data profiling is generally part of larger enterprise data catalog initiatives, but sometimes it goes without it, more than that in a later slides.
So could dive deeper into types of quality checks that you need to be doing in your data monitoring pipeline. There will be things like completeness, are all of the items recorded as they have to be.
You also have to check for consistency, meaning that you have to check if you can join the same record in your data lake with records coming from the other data sources or located in the other data sources. The example of a format that might be inconsistent is the date format. In United States, the most common format is month/day/year, and in Europe, the date formats might be a day/month/year. Essentially all 5/09/2021 might mean either May the ninth or September the fifth. And if you don’t have a consistent format for dates, you’re in a big trouble and you have inconsistent data. This is just one of the many examples of inconsistency. The other thing is uniqueness. Obviously your data set must not contain duplicates, mostly they don’t happen, but when they do, you have to build the strategy what to do with them.
And duplicates may happen because of limitations of the infrastructure with at least once delivery. And so like sometimes it’s just easier to accept at least once delivery and to fix the data later at later stages, then making sure that your delivery mechanism is exactly once. The other, dimension that you need to be looking at is timeliness. Essentially, you need to make sure that the late in CFU in third, that the pipeline is no bigger than acceptable threshold. And if you’re monitoring that threshold, you’re in a good position because if suddenly the data is not arriving to its final destination in due time, you might be getting some isolate reaches. Now that being said to look at our relevance and validity and we’ll go deeper into those.
So there are two fundamental approaches that data quality is done nowadays. The first one is record level, it’s stream-friendly and if you are doing record level data quality you can quarantine invalid data so that you can debug and reprocess the invalid data at later stages of development. Make sure that you are going to be watching or re-watching, make reliable ETL easy on Delta lake talk because that talk is gonna reveal something interesting. And so you might learn about new concepts and I usually recommend re-watching that make reliable ETL easy on Delta Lake talk, this talk is going to be about data quality on the whole database level, which is batch by nature, essentially its purpose is to see the health of the entire pipeline. And the aim for database level data quality checking is to detect processing anomalies. You cannot really detect them easy by using purely record level data quality.
And usually on the whole database level quality testing, you employ reconciliation testing and mutual information analysis techniques, essentially something you cannot really do easily or fast enough when you are only doing record level data quality. Do you need to pick one? No. Do you need to use both of them by no means, pick both of them and make sure you use either more of database level data quality or record level data quality depending on needs, but remember record level is to current time and reprocess later, and database level quality is about making sure that the whole, data sent is consistent.
So there are a couple of ways how you can be making data quality checks. And obviously one may tell, ask subject matter experts because they know how to define the shape of data, but they may not fully cover all of the coordinate cases. And there might be more data sets than actual subject matters experts and only relying on expertise would not get you far.
The other approach is exploration. With exploration, the people who are making the data quality checks are becoming the experts of data. This really happens when you onboard new data sources frequently enough and into kind of develop the subject matter expertise of what’s in those. So doing exploration type of data quality check development, may result also in under coverage. And you may miss the alerts once in a while, but still it’s better than just relying those subject matter expertise, which might not be there yet.
And the third approach is semi-supervised code generation of data quality rules, essentially it heavily relies on the profiling and helps data engineers who need to cover in new data sets with the automated tests quickly enough data engineers are working with automated data quality rule generation endurance may need to edit the generated code because generated code may over fit the rules with too strict criteria, and people on call may receive too many alerts and stop trusting the alerts on data quality. Once you lose the trust of the alerting source, it’s very difficult to gain that back, so make sure you don’t over alert from the beginning. And the most important thing here, use all of these three techniques to develop the quality rules because just relying on one would not get you far.
Couple of solutions exist on the market and they are either community-based or they are bankers that’s enterprise software tools. Some of those libraries are shipped as embeddable, biotin wheels, or Java jars. And sometimes they are shipped as a stand-alone platforms. It’s up to you to pick one that fits the best, the needs and habits of your organization. All of these tools employ techniques like success keys, essentially success key is zero or one key that is averaged to show the percentage of number of all these records. It’s very student friendly. So record level data quality is using only success keys. The example of success key is my email having valid format, one or zero, or are all of the fields on this record valid or not, one or zero. The other type of technique maybe called domain keys, domain keys check the incoming batch with the previous existing data set. It might be slow to compute, but things like computing number of unique keys is the perfect example of the main key technique that validates against the existing dataset.
The other technique that is being employed frequently is called dataset metrics, which is essentially a materialised synthetic aggregation like, is this batch two standard deviations away of number of records from previous batches. And whenever you have incoming batch that is two sequence away, you make the alert. This technique is the most simple one, but it actually catches a lot of serious issues without over alerting. And the other technique that has been employed might be called reconciliation tests, where essentially you’re repeating the entire computation independently to verify the things like, balances on the accounts are in saying buckets. So typical example here is a double entry bookkeeping where number of debit, the number of credit dollars must zero out. And essentially some of the systems are building in reconciliation testing into their design.
So let’s talk about frameworks. If you build your own everything, consider embedding the tool called Deequ. It has a constraint suggestion engine, and so data profiling on their enterprises features. This tool is very mature in terms of check availability, though, documentation is pretty scarce. You have to read a lot of scholar source code, and most likely, my big advice, you’ll have to fork Amazon Labs Deequ into your organization and start maintaining the copy of it. You will save a lot of time, by just forking the stool internally and so you’ll not be relying on slow pull request acceptance that this shown as historical trends within this repository. It’s been there for a while, it’s been bagged by Amazon labs. There is no production support for it, but you can use it, save some of your time by writing the data quality framework. It’s better to use an existing one.
One of very good features that they can like about Amazon Deequ is the code generation. Given a dataset, it can start a spark job that will profile your entire dataset, and then outputs are perfectly working automated data quality checks, essentially you will not have to spend time figuring out what are they allowed ranges of WLDR, or what is the uniqueness constraints of the call? They’re just launched. You just launched construct suggestion runner, and then it will spit out the codes that you add in the beds and put to your production pipeline, but you’ll have to do it only after you collect at least week or month of data.
The second tooling I would like to tell you about is called Great Expectations, which is less enterprisey data validation platform. It’s written entirely in Python, it’s Python first, and it focuses on supporting Apache Spark among other data sources like Postgres, Pandas, SQL Alchemy, BigQuery, and others.
And if you’re Python first company, you might look at this. Essentially Great Expectations is having native bi-directional integration with Pandas profiling toolkit. Pandas profiling toolkit provides exploratory data analysis on steroids by generating HTML reports that has distributions of data and number of unique values, there was written keys, correlations, and many other techniques. If you haven’t heard about Pandas profiling, here’s a link that you have to go and visit now because it might greatly help in discovering new correlations of your data set and having your datasets documented on a regular basis.
The third tool I would like to know about is called a Apache Griffin, which may be the most enterprise-oriented tooling with a user interface. And it’s Apache top-level project and it’s backed up by eBay since 2016.
It’s not as easy embeddable to existing applications because you have to run a separate spring boot based process, and it requires a JSON DSL for configuration that has to be deployed somehow. But still it has the user interface and having no user interface sometimes is worse than having some user interface. So be aware of Apache Griffin and essentially it might allow you to combine the spark SQL with other DSLs. So you’ll have to learn in other DSL anyway with any of this frameworks.
But now I want to go through a couple of examples on how typical DSLs for Stillings do look like. So in this very first example of completeness check, we want to catch the columns that are having no law use. So as you see the codes top left, that is related to Python implementation of Deequ, you may see it follows, builder pattern.
So if you are coming from software engineering backgrounds, if you’ll allow spark data frames, Deequ might be the most easy to write. There is easy to write checks for you because it follows, spark data frame ideology and it’s builder friendly. It’s software engineering friendly. So you’ll love it if you are coming from that background. If you’re coming from Python background, and how about the Fenders code base in your experience, since if you want to have very explicit method names, Great Expectations is the framework for you. Of course, you can write any data quality framework from scratch in PySpark or scholar that the frame Syntex from scratch completely by yourself. Of course, you can, do you want to do it? Probably not, but you can. And of course you can do data quality checking with simple SQL. So data quality is simple monitoring that may scale to thousands of different checks in a typical enterprise, and make sure you have sound framework or it, or you take the best points from the existing framework.
The other example is about the uniqueness checks. Uniqueness checkas are quite similar to completeness checks, but they just check number of records in a data set against the number of unique records within that column, within a data set. And then what’s a metric? Of course, it’s very easy and transparent to do with a Deequ and it’s also as similar to do with Great Expectations and SQL and Spark, let the say.
But then once you start employing a bit more advanced checks, like is this data set having some of this column greater than some of the other column provide me the alert. It might be a bit more difficult to do in other frameworks than Deequ, so keep that in mind. Then type of metric you have to be looking at frequently is timeliness of data. Essentially, timeliness of data is the amount of seconds between current time and the event time that happens at the lowest level of granularity in the source system. Usually your data travels, among many different steps in your pipeline, and you have to just subtract the current time. You have to subtract the events time as it happens in the edge from the current time. And outputted as number of seconds that is a delay of processing for this. Make sure you monitor that because this is the simplest and the most effective way of knowing if you are late with processing your data or not.
I’d like to mention a couple of other frameworks that were not featured in this presentation. So one of them is called Drunken Data Quality or DDQ. It’s pretty fluent in syntax, it may be even better than Amazon Labs PQ, but it’s no longer maintained. The other upcoming data quality framework is called Data frame Rules Engine from Databricks labs, it’s purely scholar oriented, and it didn’t have lots of contributors recently, and it’s pretty fresh, it’s only like nine months old at the time of this presentation. So you might look at those two as well to gather inspiration or make a fork. And once again, make sure to re-watch make reliable ETL easy on Delta lake talk to get more insights on how to deal with record level data quality in an easy and reliable way.
Thank you. Please leave the feedback in the form below. And so it was a pleasure delivering this talk to you. This was Serge from Databricks. Thank you.

Serge Smertin

Serge Smertin is a Resident Solutions Architect at Databricks. In his over 14 years of career, he’s been dealing with data solutions, cybersecurity, and heterogeneous system integration. His track r...
Read more