Automated Testing For Protecting Data Pipelines from Undocumented Assumptions

Download Slides

Untested, undocumented assumptions about data in data pipelines create risk, waste time and erode trust in data products. Automated testing has been one of the biggest productivity boosters in modern software development and essential for managing complex codebases. Data science and engineering have been largely missing out on automated testing. This talk introduces Great Expectations, an open-source python framework for bringing data pipelines and products under test. Great Expectations is a python framework for bringing data pipelines and products under test. Like assertions in traditional python unit tests, Expectations provide a flexible, declarative language for describing expected behavior. Unlike traditional unit tests, Great Expectations applies Expectations to data instead of code. We strongly believe that most of the pain caused by accumulating pipeline debt is avoidable.

We built Great Expectations to make it very, very simple to:

  1. Set up your testing framework early
  2. Capture those early learnings while they’re still fresh
  3. Systematically validate new data against them. It’s the best tool we know of for managing the complexity that inevitably grows within data pipelines.

We hope it helps you as much as it’s helped us. Main takeaways:

  • This talk will teach you how to use Great Expectations to get more done with data, faster
  • Save time during data cleaning and munging.
  • Accelerate ETL and data normalization.
  • Streamline analyst-to-engineer handoffs.
  • Monitor data quality in production data pipelines and data products.
  • Simplify debugging for data pipelines if (when) they break.

Watch more Spark + AI sessions here
or
Try Databricks for free

Video Transcript

– Hi, I’m Eugene Mandel, one of the core contributors to Great Expectations and Head of Product for Superconductive, the company behind Great Expectations.

Automated Testing For

And today we will be talking about how to not lose time and even worse, not to lose trust of your colleagues and your users who use your data due to a particular monster; Pipeline Debt.

Here’s the agenda. First we will talk about the concept of pipeline debt and why what is it? Why is it that and how it was created? Then we will introduce Great Expectations. And then we will go over how you can start with Great Expectations.

So what is that pipeline debt we want to beat? It’s a technical debt in data pipelines that mainly occurs as a result of tests that are missing and documentation that is missing. Let’s take a story of a particular of a fairly typical data pipeline.

Let’s say that your company starts a data team.

Your data pipeline

In the beginning it’s probably one data scientist and you’re it. You find some useful data sets, probably maybe the log events that come off the servers of your product, maybe your CRM. You find them useful, you start building a pipeline that takes them, ingest them processes them, combines them and produces something extremely valuable, maybe an analytical table, maybe a dashboard. You show it to other people. Everybody is happy and everything is very good. Now encouraged by your initial success, you’ll get a bit bolder. You’ll notice that for example the analytical table that your pipeline outputs can be very useful to train the model that improves some business process. So you do just that and your pipeline just grew in depth.

Then another organization in your company another team sees your success and wants to replicate it. They find their own datasets and build a parallel pipeline. Now, this system grew both in depth and in breadth. And now the other team and your team starts talking. You discover that you can benefit from some of their data and they can benefit from your data. Great. You create all those new links between your pipelines and the two pipelines become one.

And everything’s great. You deliver value, you deliver results, everybody is happy. But this picture (mumbles) like (mumbles) the data pipeline really wants to be a hairball. And unlike what regular software engineering is called spaghetti code it’s not a bug. This is the natural state ideally of data pipelines. This is how they become valuable.

You do deliver value. But you have this fear that if you want to refactor your pipeline, you’re just afraid that by touching one thing, you will create some breakage somewhere else. And the debugging is a problem as well. Let’s say that you found a problem in one of the nodes. Well if you’re lucky you found the problem. If you’re unlucky, it’s your execs or your users found the problem. How do you debug it?

You have to trace back from that node through every note in the pipeline that this note has dependency on. This robes you both of time and much worse, it robs you of the trust of users who rely on your data. So what is pipeline debt?

What is pipeline debt?

Is when your data pipeline is undocumented, untested and as a result of it unstable. If you tell this story to a software engineer, they will probably say that it’s a solved problem. This solution is automated testing.

Solution: automated testing,

Your unit test, your integration test, your system testing and UCI. And that’s correct. Data pipelines are software and everything that, and all the best practices of testing the code apply. But data pipelines, in addition to untrusted code needs to be tested. Bring another untrusted entity data and testing data is different from testing the code in two very important ways. First, you control your code. If it fails your tests, you can just change it. And data you do not necessarily control. Data might be coming off some data generating process that you observe but can not change. The second way is that in code testing, your tests really specify correct behavior. If the test fails, if the code fails the test, it’s wrong. In data testing, it’s not as simple. Data is almost like a natural phenomenon that comes off the data generating process. When you use it for a particular purpose, you make assumptions about it in your pipeline, and you can verify it. If those assumptions make this data fit for that purpose.

So now that we introduced the monster the pipeline debt, let’s about how you can beat it and how this, how Great Expectations helps there.

Great Expectations is an open source project for testing and documenting data that helps fighting pipeline debt. It started as a nights and weekends labor of love project of the two original authors, Abe Gong and James Campbell. It was publicly launched about two years ago and for the past year, it was under ECMO active full-time development backed by whole team. It is the most popular open source library for data pipeline testing. And it has a growing and very active community of users and contributors, both from GitHub and on Slack.

The core concept of Great Expectations is unsurprisingly an expectation. An expectation is a declarative statement that describes a property of a dataset.

Describe expected behavior

Let’s take an example. You have a column of those values. The column is called temp_f and here’s what you know about this column. It’s indoor temperature readings that come from some sensor that you can use in your pipeline. You start by describing the expected behavior for the data in this column. For example, values in this column should be between 55 and 19, at least 95% of the time.

You know that the true be true because you know where this data comes from. All of this is done in human language and Great Expectations helps you to facilitate this communication about data, human to human, human to machine, and machine to machine. Let’s see how.

Declarative language

You can express the same expectation using the declarative language of Great Expectations. Let’s go over it line by line. The expectation type is expect column values to be between X and Y. And in the next several lines, you see several arguments of this expectation. It applies to column temp_f the maximum expected value is 19. The minimum acceptable value is 55, and mostly argument is extremely helpful to implement some fuzzy logic. Mostly point 97 means that even if up to 3% of values in this column are noncompliant, you should not fail the whole column. This is enough for communicating between human and the machine. But you don’t want to lose the context why you created this expectation. So for other human users, you can add some comment or color. You can add a note, in this case this column contains indoor temp readings taken in California during spring and summer. Now we both understand why it makes sense.

Validate: take the compute to the data

The next step, and this is how we use Great Expectations, is when a new batch of data comes, you can validate this batch of data against this expectation. The declarative nature of an expectation allows the library to take this declaration and translate it into different compute engines. It can translate it into pandas and validate that pandas data frame on one desk on one computer. It can translate it into Spark and can validate a data frame on a Spark cluster, or it can translate it into SQL using SQLAlchemy, and it can validate a table or query result in multiple databases.

Expressive and extensible

What you can actually express using Great Expectations.

Great Expectations comes with a library of a few dozen of built-in expectation types.

For example, expect column to exist.

Behind every expectation type in the library there is probably some kind of data horror story of something going very wrong and something that something could have been prevented if this is expectation was verified.

Every data scientist I know has their own data horror stories. Mine is for example runoff. In a previous company, we would periodically training the system to provide answers to frequently asked questions in customer support. One morning I come to the office and the last night’s training results are off the chart, just they’re amazing. And one short moment of elation is replaced by disappointment when we started digging in. Apparently somebody replaced question and answers. Well question and answer columns, and instead of training answers based on questions, the system trained themselves to solve a much easier apparently problem of choosing an answer based on an answer. Of course the, this problem was of no particular value to anybody. So here’s a, here are other examples of expectation types. Expect table roll count to be between X and Y. Expect column values to be unique. I won’t read through all of them, but you can reason about values of a column, and you can also reason about a column in aggregate. For example, you can use KL divergence to verify that values in a column follow particular distribution.

One of the core realizations that we came up at Great Expectations is that data testing and data documentation are two sides of the same coin. Everybody understands that documenting data is very important, yet it’s extremely challenging to keep your data documentation up to date.

What Great Expectations does for that is it can take a suite of expectations, a group of expectations that describe a particular dataset and render it into HTML.

Your tests are your docs

Then this HTML can be automatically deployed as a static website that your team can use as a source of truth for communicating about what data you have and what it should look like.

Setup and Configuration

Great Expectations also comes with a lot of convenience features.

A CLI that helps you deploy it. And a lot of modules around it. It’s very typical that teams that find Great Expectations, they at first, they implemented their own library for validating data, but then they discover that the solving the problem requires more than a library. You have to answer the questions like how you create those tests, how you deploy them, how you store results, how you communicate about those results. So what Great Expectations is doing, it’s coming with default answers to most of those questions while remaining flexible enough, to for you to provide your own answers.

Let’s talk about particular flavors of problems or of data problems of pipeline debt that Great Expectations can help solve them. I’ll just come I’ll just talk about a couple of examples. So first Data Drift, let’s say you have some kind of numeric value and your data pipeline processes new batches daily, or hourly. Your pipeline has an expectation an assumption about what is the, what is the useful range, acceptable range of values in this column. But slowly over time, it keeps drifting or changing. And at some point it goes out to this assumed range and the pipeline might produce results that make no sense. You can create an expectation for that. You can just say that “Well I expect this values to be between X and Y”. You can create expectations about it’s mean, or a standard deviation or any other, or any other kind of properties or the distribution.

Another example is an Outlier.

Pretty much anybody everybody in, in machine learning class used to write a model that predicts house prices based on some kind of data or based on some kind of data set.

This model probably used number of bedrooms and number of bathrooms as one of its features. If this model is asked to predict the price of a mansion with 20 or 30 bedrooms and bathrooms, it should be smart enough to say that “I just don’t know”, because it’s an outlier.

Now machine learning models of course are notorious by in the sense that they don’t say what they don’t know usually. So they don’t just predict, and they will predict something nonsensical. You can create an expectation for that. You can say that the reasonable range of values for number of bedrooms and bathrooms are no between one and five or 10 but not 20 or 30.

Another example is Outage. If the two previous examples of data problems, were mostly concerns of data scientists that think what the values should be. This is an example that is more of a concern to data engineers. Let’s say you have a data pipeline that processes daily usage statistics of your product.

What will it do if it runs one day and it finds no new files? It might just keep running happily and spit out the result that today your product got exactly zero new users, just without understanding what is wrong. You can create expectations for that.

There are many other examples of what can go wrong. Of course bias in datasets is extremely important, and you can create expectations for that too.

How you can get started with Great Expectations.

How can I get started?

Check us out on GitHub. It has all the list of releases, basic explanation. If you check out our GitHub, leaving us a star is highly appreciated.

From there you can go and read the documentation. In the last, literally in the last several weeks, we are going through a very serious effort of updating our documentation,

to make it useful, to make it very clear what, like what you can use, what you can use it with, what are the work what is the workflow and how to configure everything. And of course visit our Slack. We have a very active community where you can ask the contributors questions. You can to other users and you can stay up to date. And of course you can try Great Expectations. Just install it from pip or from Conda

run the “init” command and connect it to some data set and see what it does.

Thank you for listening. I will be very happy to answer any questions and I will be extremely happy to hear your own data horror stories, which I’m sure you have as well.

Watch more Spark + AI sessions here
or
Try Databricks for free
« back
About Eugene Mandel

Superconductive

Eugene Mandel is Head of Product at Superconductive and a core contributor to the Great Expectations open source library. Prior to Superconductive, Eugene led data science at Directly, was a lead data engineer on the Jawbone data science team, and co-founded 3 startups that used data in diverse fields - internet telephony, marketing surveys and social media. Eugene's core interest has been turning data into real products that make users happy.

Abe Gong
About Abe Gong

Superconductive

Abe Gong is a core contributor to the Great Expectations open source library, and CEO and Co-founder at Superconductive. Prior to Superconductive, Abe was Chief Data Officer at Aspire Health, the founding member of the Jawbone data science team, and lead data scientist at Massive Health. Abe has been leading teams using data and technology to solve problems in health care, consumer wellness, and public policy for over a decade. Abe earned his PhD at the University of Michigan in Public Policy, Political Science, and Complex Systems. He speaks and writes regularly on data, healthcare, and data ethics.