Democratizing Data Quality Through a Centralized Platform

May 27, 2021 03:15 PM (PT)

Download Slides

Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.

At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:

  • Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal 
  • Performing data quality validations using libraries built to work with spark
  • Dynamically generating pipelines that can be abstracted away from users
  • Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
  • Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
In this session watch:
Yuliana Havryshchuk, Developer, Zillow
Smit Shah, Senior Software Engineer, Big Data, Zillow

 

Transcript

Smit Shah: Hello, everyone. My name is Smit Shah. Along with me I have my colleague, Yuliana Havryshchuk, and we both are from Zillow. Our topic for today is Democratizing Data Quality at Zillow through a centralized platform. Data Quality is right now one of the most talked areas in the industry, and everyone are working on it to make sure they’re monitoring the health of the data that they use, and ensuring they’re consuming accurate and trustworthy data in order to make their business decisions. We at Zillow wanted to democratize this whole data quality process, and we were able to achieve that via a centralized platform. Hopefully, you are able to learn something from our talk, and we are super excited to share our findings.
Just wanted to update everyone. If you have any questions throughout this talk, please make sure to use the Q&A feature of the video conferencing app. Okay, so let’s begin. As I said, we are from Zillow. But within Zillow, we are part of the Data Governance Platform Team. Our team is responsible for two primary components. One is the platforms and processes around data governance. There are four pillars to this, which is, data discovery, data quality, data security, and privacy. This talk over here covers topics around data discovery and data quality pillar. Let me introduce all the team members. My name is Ahmed Shah, and I’m working as a Senior Software Development Engineer in the Big Data field. As I said, along with me I have my colleague, Yuliana Havryshchuk, and she’s working as a Software Development Engineer in the Big Data field.
Just wanted you to know, all this work that we are doing is all the work done by our Data Governance Platform Team, and we both have the opportunity to present this today in front of you. Our agenda for today is going to be, What is Zillow? Data quality challenges that we faced. How we are solving that using the Centralized Data Quality Platform. Within that we are going to cover, the architecture, self servicing part, pipeline integration. At the end, we are going to wrap it up using the key takeaways from this whole job. Okay, so let’s begin. Let’s talk about Zillow. As many of you might not know, Zillow is an online real estate company which is reimagining real estate to make it easier to unlock life’s next chapter.
Now, how is this possible? We offer customers an on-demand experience for selling, buying, renting, and financing with transparency and near seamless end-to-end services. Now, this is possible with all of our brands and businesses within the Zillow group umbrella. We are right now the most viewed real estate website within the United States, and our investor relationship status as of Q4 2020 also shows that. If you see, we had around 201 million average monthly users, 2.2 billion visits. As of now, we have more than 135 million homes which sits in our database. So, you can understand why data quality is important to us.
Let’s go to the next topic, which is various Data Quality challenges. Within data quality challenges, I wanted to emphasize why we want to monitor data quality. As I gave you an example about Zillow, we are a real estate company. Here is one of the example of one of the home details page within our website. Over here if you see, visual information around the number of bedrooms, number of bathrooms, the square footage, the address of the home or the home information, location, and other information around Zestimate. Then if you scroll more down, you will see information around your history, which is the public records details. We also allow you to connect to various agents for buying or selling homes, and neighborhood informations and various other details if you go visit our website.
As you see, this data of fuels many of our customer facing as well as the internal services at Zillow, which relies on high quality data. I was going through some of the services and one of few of our important services are Zestimate. Which is used to estimate the price of the home using various informations around that we have, as well as just gives you an example of what your home might be worth. The next is Zillow Offers, this is used to give you a prize that if you want to sell your home, we can buy the home from you and the whole transaction will be hassle free, and you don’t have to go through any of the buying or selling process. These two are one of our main products.
Apart from that, we have our Zillow Premier Agent which allows you to connect to various agents for buying or selling homes. One of our other team which is our Economic Research Team, which uses all this information and does various market analysis, and they also publish all this analysis on our website. As you see, data is very important to us. Any form of bad data will lead to bad decisions, and overall a broken customer experience. Now, some of the services relies on various AI and ML based systems, so we want to make sure all the data that is used by those ML models for predicting the price of the homes or recommending you various homes, they have to be also accurate. This overall makes sure we are giving you or showing you the relevant information. We also have to monitor the performances of our systems which are based on AI and ML.
Let’s go through the various challenges that we are facing, or that we faced overall, when we were building this whole platform. Before we go much deeper into the challenges, let me introduce four main terminal terminologies that we will be using throughout this job. The very first thing we call them as producers. Now, producers are someone who generates the data, which pretty straightforward. The second is consumers, consumers are anyone who uses the data which is produced by your producers. The third concept is pipelines, which is an ETL process that is used to do the transformations on the data by the producers. Fourth is the data health matrix. Now, this can be any form of health metadata about your table, date or column based on your business requirements that your producer produces, or that your consumer consume.
For example, let’s say a column within your table needs to have certain range of values, or a column cannot be nil, and so on, so forth. Okay, so let’s go back to the challenges that we were facing. The very first challenge was, there was no standard way to monitor data quality. Okay, now over here when I say standard, there are two types of standards. The very first one is there was no standard validation code, or process, or technologies, that our producers can leverage to evaluate data matrix, and they can incorporate in their data pipelines. What ends up happening is each team within our company ends up spending resources to build such reusable libraries from scratch, or consuming some different open source tools, and everyone ends up reinventing the wheel within the company. This leads to various time and energy wasted.
The second one is around there were no agreements between the producers and the consumers on what is an acceptable data health of the data that the producer is producing, and what the consumer is consuming. What it ends up being is, consumers do not know whatever data was produced is a good quality data or not. Can I use this to consume in my downstream pipelines or make any business decisions? The second problem was lack of visibility into data health. Now, let’s say even if the data standard exists. But the second problem was all these validations or all this monitoring that was happening for health of your data were sitting on the producers process, in their logs or in the system that they are running, and they were not getting surfaced or shared to the consumers.
Overall, it becomes very difficult for consumers to have a visibility on the health of the data that they are consuming. It can be, for example, for any given data date, or for all the ranges of data date that they are consuming. They are, again, not able to make any strong or accurate decisions before they can start consuming the data. The last one is no known lineage between data and processes. Now, as you know, a data is generated by a process and there is not always a single process that generates their data. Within an organization, a data goes through multiple ETL processes, different pipelines. You will end up combining data from multiple sources, so it is very important to have this holistic view of all the data when they go through all these different processes.
This ends up helping you figuring out if there are any problems within a specific part or specific process in your pipeline, in your data, then you can go and figure out where it started from? If it is happening somewhere, who are your upstream or downstream impacted consumers? These were the three main challenges, and our team wanted to overcome these challenges. The answer was pretty straightforward, which is, Centralized Data Quality Platform. Okay, so let’s let’s go more into what the Centralized Data Quality Platforms are? What are the important things which are needed for this thing? We came up with five pillars for our data quality platform, let me go through each one of them. The very first one is standardizing data quality rules.
As I talked about those concepts of producers and consumers, and the data pipelines, this is where it becomes standardizing those rules. Producers defining certain sets of checks that they are planning to perform on the data that they are generating. Let’s say, for example, they are generating the data every day, every hour, whatever is the cadence. Also, what types of checks they consider are going to be a breaking check or a warning check. Let’s say, for example, if all the records are now … any one of the record might be a breaking check for a specific column, but it might be a warning check for another column. Now, based on this rules, also the consumers, they work with the consumers and they agree, both of them agree on all these sets of checks and the thresholds which are set.
This becomes a formal contract between your producers and the consumers that they both agree on. Now, once the contracts are defined and the producer teams are producing the data and doing the validation, the second problem that we talked about was increasing the visibility of the data health. Producers are using some standardized tools to evaluate those checks for any given data date. They now need to surface those results, which we call as the health metrics of the data, through a centralized UI. It becomes visible to all the consumers, and it also becomes visible to the producers as well. This is another pillar.
Now, third one which is enable safe evaluation of rules. This one is tied to the first pillar, which is standardizing data quality rules. But, as you know, in an agile process or when the requirements keeps on changing, your contracts which are defined in the pillar one will also keep on changing. There will be always a case where the producer or the consumer might have to update some parts of the contracts, and we want to make sure this contracts are updated in such a way that it does not impact other consumers as well. Let’s say, for example, there was a specific column and the consumer had a requirement that the percentage of [inaudible] should be less than 10%. But now if something just need to be happen, they both are working cohesively and making those changes.
The fourth pillar is supporting built-in alerting. Now, as you know, we are talking about data quality and we are talking about monitoring the data health matrix, but what if something fails? What if some part of your contracts are breaking? The next step is you want to alert them. As part of your platform, you want to make it easier to provide built-in alerting that the pipelines can leverage. As part of your organization, you can support different alerting mechanisms like email, messaging, or whatever is your on-call systems. Within those alerts, you can add more contents related to what was failing, and what was the expected value, and what is the observed value.
Now, the last pillar is integrating with data lineage. As I said, we need to have a holistic view about all the data and the processes that are within your organization and within your flow, so you need to have this whole picture. In this case, it can also include all the downstream consumers, for example, some BI reports which are created, so you know what are the impacted reports as well. Using this five pillars of data quality platform, we came up with our platform architecture, so let’s go over that. This has a lot of information right now. This is something that we at Zillow have built, and this is also evolving over time. But this is something that we have as of now. I’m going to break this down into six parts, so it will be very much easier for everyone to digest this information and understand it much better.
Let’s go to the part one. Now, this is the basis of the centralized data quality platform. Now, each organization needs to have a clear understanding on all these different entities that exist within your organization. Now, an entity over here can be all the schemas, tables and columns that exist. Which is, basically, all the data that exists within your organization, all the different processes that exist within your organization. There are a lot of BI based reports that are created using this data, so even we are going to create even those reports as an entity. Now, all this information needs to be crawled by whatever system that you want to use, scan them on a specific cadence. You need to store all this information in a centralized database, we stored this in this database called Data Catalog, which is at the bottom left.
This forms the very much basis for all the metadata of our organization. Now, once you have all the metadata, you want to surface this metadata to all your employees, or all your users within your company through some form of UI. That’s where we have built a Data Portal UI, which acts as our data discovery tool. Which makes it easier for our users to search for anything and able to show them all this metadata information from our catalog. Now, in order to make our search much more better, we have added a layer of Elasticsearch to make it much more faster.
This is the main basis, and using all this entity, we are going to start creating our data quality job. This is the starting piece to help us build the whole lineage and everything. Next step is, now a user on our team wants to go and onboard a data quality job. What we have done is we have created a very seamless UI for our users, which they can come to this data discovery UI and just submit few information in our form, and go ahead and schedule a data quality job. We are going to cover more about what information goes into this forms later. But just wanted to know all this information within the form gets stored in our centralized config store, where you can define a lot of information related to your contracts or your data quality monitoring job that you want to run.
This can also include specific cadence that you want to run, the types of checks that you want to do. What is the table or a query that you want to monitor, who’s the owner? If you want to add any kind of upstream dependencies, and so on, and so forth. Once the user has provided all these configurations and submits the job, the next step is orchestrating those jobs and creating those validations jobs for the user on the fly. This overall helps all of our team members do not spend a lot of time on figuring out how to create a pipeline, how to deploy it, and so on, so forth. This becomes a very integral part for our data quality platform, that we allow anyone in our company to come and onboard a job.
Right now we support two types of orchestration, or two types of validation. One is for [inaudible] job, for which we use our in-built [inaudible] services, which takes all this information from the user and converts it into an Airflow dag and deploys them in a specific environment. That dag is responsible for doing all the validation and all the downstream processes. For the streaming data, we currently have use cases around Kafka and Graphite that we are currently working on. Let’s go to the next step, which is the validation process. Once those jobs are created, the main goal of those jobs is to run those validation which are defined as part of the contracts. As I said, we currently support two types of checks. One is the rule based checks, and the AutoML checks.
Within the rule based checks, we are going to go more in detail about the different checks that we support. On top of that, we support doing SLA on the processes as well. For example, if you want to have your job completed within two hours, and if it is not, then you want to get notified or not. We run this rule based checks in our support systems, and as well as we have support [inaudible]. Now, within our AutoML process, this is a little bit different from the rule based checks because within any AutoML systems, there is always two components attached to it. One is the training and the scoring. These dags that we deploy takes care of doing an offline training by pulling historical data. Also, it does a lot of pre processing and generates the valid models for each of those metrics that needs to be evaluated.
The scoring dag is responsible for actually pulling those models, pulling the current data point that needs to be monitored and evaluating that specific matrix to figure out if it is an outlier anomaly or not. All of this process, again, is supported by Spark, so this makes it easier for us to do a lot of distributed processing and providing faster results. Now, once the validation is done, it is pretty obvious the next step will be storing those outputs. That’s our part five, which is storing all this validation results in our centralized database where all of the services go and write all the validation results and make sure the results are tied to that specific task that was on boarded by the user. If a task has multiple metrics, then each of the results are tied to those specific metrics.
We also use this database to store our model objects as well because the scoring job needs to use that object for doing scoring. Now, there will be a lot of cases in your organization, as well as this was a use case with our organization as well, that not all the teams can immediately migrate to your centralized platform. They need their own time and their own bandwidth when they can onboard into this platform. We all also wanted to make sure we support all these external processes, which they still want to run in their own system and they don’t want to migrate, or they’re planning to migrate. Now, we also wanted them to leverage all these other features that are provided by our platform, which is defining the contracts and surfacing those validation results.
All these external processes, what they need to do is submit the contract in a specific format that we want, and also do the evaluation in their own systems. They can still leverage all these libraries that we provide, and once they do the evaluation, submit those results to our databases. From here, once the outputs are written straightforward, the next step is alerting, whenever there is an outlier, or whenever there is an anomaly or any of your contracts are breaching. We didn’t wanted anyone to go ahead and create their own alerting system, so we went with creating a centralized alerting service, which is part of our platform. This alerting service is responsible for figuring out the contract, the results, who are the producers and the consumers, and who all needs to be alerted within this pipeline?
We currently support sending alerts via Slack, email, and our on-call rotation services. Now, once a user is alerted, but the main part was a surfacing those health matrix in this centralized UI. That’s where we are again coming back to our data portal UI. Remember we talked about entities, for example, a table that we want to monitor. Now, let’s say, we were monitoring that table and we had went through all this process and the results are stored. You want to tie back those health metrics of those checks to those entity again. What we do is we surface those results back to the UI, we provide them a very detailed page, which we’ll show you later in our slides as well. This covers all the six parts of our main architecture. I’m going to hand it over to my colleague, Yuliana, for going more detail into some parts of these components.

Yuliana Havrysh…: All right. Thank you, Smit, for walking us through the platform architecture. Next, I’m going to talk about how all of those components are used in a self service way from the user’s point of view. We decided to build out the self service experience for a platform for a few different reasons, and with a few goals in mind. The first goal is for the platform to be scalable, and there are two ways in which it should be scalable. First of all, obviously, it should be computationally performant, which should be able to handle the largest data sets at Zillow which are petabytes of data without blocking any pipeline for hours.
The second way that it should be scalable is in terms of the number of users that are using it. After we initially released it as more teams wanted to onboard, we realized that our team was becoming a bottleneck in terms of support. Even with extensive documentation, we’re fielding questions about our APIs and about our data model during the onboarding process. We wanted to build a self service experience that would free up our team’s time to build cool new features, and then also empower users to onboard themselves. The second goal for the self service experience is for any type of user to be able to validate data. In the architecture that Smit shared, you saw that we’re using Spark, Airflow, FLaNK), Python, various AWS services.
And so, we wanted to ensure that it’s not just data engineers that are already familiar with those services that are able to use their platform, but also scientists, analysts and PMs, who don’t know anything about even Python or airflow, but they know that they don’t want to see data quality issues in their data sets. Finally, for all of our users, whether they’re data engineers or PMs, we wanted to minimize the onboarding time with a streamlined process. This means minimal configuration, so which data do you care about? Which metrics do you care about? How often do you want to validate it? That’s really the core of what we want from users. Then all of the configuration settings like how much memory we’re using, or which models we’re using for validation. They shouldn’t know any of that. And so, with these three goals in mind, we set out to design our self service onboarding experience.
What I’m showing you right now is the first part of the experience, which is our data discovery tool. We crawl data from our various data stores, and then expose it to users in our data discovery portal. You see here, we’ve crawled a table called property details, and the user could actually come in here and enhance it with more information such as the description and the table owner. If we dive into the columns here, we have ID, the property ID, the name of the property, type, page views, and the listing agent ID. Let’s say I’m an analyst and I want to use this table for some really important reporting. First, I would want to make sure that the ID is not missing, because I’m going to use it for joining this table to other data sets.
I also want to ensure that the type of the property makes sense since my report has custom logic based on that field. If we’re looking at this table right now, we could see that this sample data has those two issues that are present. That’s generally a rule of thumb, if you’ve encountered any issues in a data set that have affected your reporting, we recommend that you turn that into a contract right away so that you could prevent those same issues in the future. Next step, after I’ve identified what table I want to validate and which rules I want to add, I would go to our self service onboarding page to create a new validation job. Here, you see that I’ve inputted our table name, so this is a Hive table under the service schema called property details. I’ve told the UI that I’m a producer of this table, so I’m generating the data that I’m then going to use downstream.
Then here, you could see that we added a few of the contracts that we talked about earlier. I’m checking that the column ID doesn’t have any missing values, and then I’m checking that the property type column has one of the allowed values that we talked about. So, condo is okay, land is okay, townhouse is okay. But, it’s not okay for us to see a value like testing, or 123, and this would catch all of those. Then the last check that you see here is to ensure that our table volume is increasing over time. This is a table of historical data snapshots, and so there’s no reason for that to ever decrease in volume.
These are just three of the contracts that we support, we support nine different contract types. Some other contract types that we support are duplicate checks, date, timestamp formatting, and custom SQL contracts. Those are probably the most valuable because a user could enter any business logic that they want to validate that’s related to their particular data set and use case, and we’ll be able to run those checks for them. After the validation job runs a few times, we can go to the next slide and we could go back to our data discovery portal to see the results. Let’s say over time I’ve added 15 different contracts and the evaluations have been executed over 1,000 times, we would be able to see the overall health stats of this table. We see that over that time, it’s past 99.83% of evaluations, which means that we have a less than 0.2% failure rate.
If we can zoom in to the last couple of days, and that’s where we see those breaking contract failures. And so, this would indicate to us that something probably changed in the last two days and we should go and look at our query or look at our upstream data sources to investigate the cause of those issues. This portal is really useful for seeing historical data trends. So, if you’re deciding which table you want to use, if you’re comparing two and one of them has a lot better evaluation stats or data health stuff, that’s probably the one that you want to go with. That was our rule based validation, but not all validations have a clear cut … have the opportunity to be clear cut with user defined rules.
Let’s say I also want to make sure that the number of pageviews in that table makes sense over time for each property. Here, if we’re looking at the pageviews for this particular property, we can see that they’re usually in the 700s or 800 hundreds, but then on May 3rd, for some reason we have 12 of them. This is really hard to make a clear defined rule about, because we would have to do that for every single property, and we would have to keep in mind how views are affected by holidays, and by seasonality and neighborhood. That’s just not scalable. This is a perfect use case for anomaly detection, which we’d be able to learn how for every particular property, the pages are expected to look, and incorporate things like holidays and the time of day, and be able to really identify something that just doesn’t look quite right, even if we haven’t given it a specific range that it has to be between.
When we’re onboarding our anomaly detection checks, we provide similar information. We have to provide which column we want to track, which is the metric. We can also provide a dimension that we want to slice it by. Let’s say that we wanted to know the pageviews for a particular property based on the device that the user is using. We want to slice it by, let’s say, phone, or tablet, or computer, so that we know if a change was made on one of those platforms, whether it affected any of the pageviews. We also see that we could add a model parameter to exclude noisy metrics. This is really helpful if the data has a lot of unique dimensions and a lot of metric columns, and we really want to reduce the number of false alarms to prevent alert fatigue.
As you see here, this is a very simple onboarding process for anomaly detection. Whereas, behind the scenes we’re using machine learning models and tuning them, so the user does not have to know anything about machine learning or AI. That’s why this is a very self serve experience. We have our own suite of models to train from and tune which are abstracted away from the user. After the anomaly detection runs a few times, the user will be able to see their results back in our data discovery portal. There’s going to be an overview section for each of the metrics that we’re monitoring, and this provides a holistic view of all the metrics that you’re monitoring according to each day to date. You could see also over time, how that changes, and we can zoom in to one of these metrics.
Here, you can see that we have two anomalies. The first one which is a solid yellow dot failed, because the value that we saw was outside of the expected range. Over time, our model learned that the expected range is about 2,200 to 3,300, and this value was 5,300. Based on what it’s learned before, including seasonal trends, this value doesn’t make sense, and so the user got an alert for that. Then a couple of days later, you see another anomaly, and this one is actually a hollow yellow circle. This means that the data for that date is missing entirely, which is another reason for the user to be alerted. You can also see we include the probability of the anomaly, and so the user could use this field to say, “Only alert me if you’re 95% sure that this was an anomaly,” to prevent alert fatigue as well.
After the user tells us which columns they want to track or which metrics they want to track, let’s talk about what happens behind the scenes. This is going back to the platform architecture that Smit walked us through earlier. First of all, the rule-based monitors turn into contracts. These are the well defined checks like no null values, only certain values are allowed no duplicates. All of these are very clear cut, and so they turn into contracts. Whereas the metrics time series monitors turn into machine-learning based anomaly detection that’s using a different library. And so, all of these different configurations are stored as data quality requirements in a particular format that’s compatible with the rest of our services, and they’re stored in our config store.
When the user submits this and it’s time to actually launch those validation pipelines, we have a service that takes all of the configuration and is able to dynamically generate the pipeline for the user. For batch data, this is using Airflow and running on our EMR clusters using the information that the user provided to us, such as the schedule and the ownership stats, so we could correctly attribute costs. We abstract the entire pipeline and ETL creation for the user so that all they see is the onboarding page that we showed you. Then after it runs, the results page that we showed you. To perform the actual validation step, whether that’s the role-based validation or the anomaly-based validation, we have two in-house libraries to support that.
Before building these, we evaluated existing open source solutions and determined that they don’t meet our needs based on limitations or the functionality, or they just weren’t compatible with the other tools that we’re using. The first one is the Luminaire Contract Evaluation library. This is written in scala and leveraging Spark for distributed performance on validations of this huge datasets that I mentioned earlier. We have nine contract models. In addition to validating that there are no null values, we also support thresholds for users they could say, 80% of the values have to be non null, for example. Then the second library that we have is the Luminaire Anomaly Detection library, and his is written in Python also leveraging Spark.
This one provides an easy way for anybody to run sophisticated time series anomaly detection without having any prior machine-learning experience. This one comes with additional functionalities like data pre processing, different model suites, and the ability to automatically select a model and then perform hyper parameter optimization based on the specific metrics that you want to track. We actually open sourced this library in the past year, so feel free to check it out, give it a star on GitHub. We have a separate talk during this conference that talks about that library in a lot of detail and the open source process that we went through.
Now that we know how the user onboards a new validation task and how we turn that into a pipeline, let’s look at the way that this has affected all the pipelines that are running at Zillow. This is what our pipelines, a very simplified view of what our pipelines looked like before for a data producer, who is anybody that generates data. Generally, they would do some processing or some ingestion, maybe some transformations, and then write that data to S3, which is our data lake. Then after that, once everything’s complete and they’re ready to serve that data to anyone who wants to use it or downstream pipelines, they would add it to the Hive table. And so, we have some of our Hive tables are partitioned, and so they would just add a new partition, and in other cases they would overwrite the Hive table entirely with the new data set.
As for the consumers, first they have to determine whether their new data is ready for use. Let’s say, for example, that this data set gets updated daily, so the user has to be aware whether that update has completed or not. For this step, they would check it using a few different methods. Sometimes they would use Airflow external task sensors, sometimes they would use S3 prefix sensors. Sometimes they would build out custom solutions with a producer such as using SNS and SQS. Then once they got the flag that the data is ready for use, then they could go ahead and use it in their downstream pipelines. This part wasn’t ideal because there was no standard solution and it was hard for the user to be certain that the data that they’re about to use is actually complete and accurate.
How have these data pipelines changed after we integrated them with data quality, or after our users integrated them with data quality platform? You still see the steps of writing test free and writing to the Hive table, but there’s a couple additional measures in between. The first one is actually running that validation that we talked about earlier, and this could be the rules, the contracts, or the anomaly detection. Before we want to write that data anywhere that’s accessible to our users, which is in this case Hive. We want to make sure that there are no errors to prevent them using it if that’s the case. Once we execute that validation using the two libraries that we talked about in the previous slides, then we would ask the question, how did that validation go? Did all of the contracts pass, or was there an issue that was detected.
And so, in the case that there was actually an issue that was detected, let’s say the ID was missing for many different properties, we actually want to stop the pipeline right there and not write to our Hive table until it’s fixed. An engineer, usually a data engineer or the scientist that created that pipeline, would go ahead and look at the data, investigate it. Sometimes they would have to talk to their upstream teams to determine if this is actually expected for some reason, and so the pipeline could be manually continued. Or, in many cases something would actually have to be changed about the data, or it would have to be reprocessed. Then when all the contracts do pass after that reprocessing, only then do we write the new data to the Hive table.
Then the next new piece that we have in our producer pipelines is actually sending an availability flag to the consumers about the new data that came in that date, the new partition or the updated table. We talked before about how the consumers would have to use different types of sensors or have a custom solution, and that wasn’t always reliable. Now we’re requiring that the producer whenever they can assert that the data has good quality, and that the data is complete and present in the high table, they send a signal to their downstream users that, hey, today’s data is now available, please feel free to use it. This is completely automated, so the consumer pipeline will actually start off with that step of asking, is the data available? This would check our contract registry where the availability would have been reported. If a pipeline has set this dependency, it’ll keep pinging until it gets that signal. Then it’s free to continue with full confidence that the contracts have succeeded, and that the producer certifies that the data is complete.
Another new feature that we have here is that the consumer could add business specific validations to their pipeline. This could be if the data set that the producer is generating is, let’s say, all of Zillow traffic, there are different teams that care about different subsets of that table. Some teams care about rental properties specifically, some teams care about the Zillow owned properties, and each of them might have different needs that the producer doesn’t necessarily share among all of their consumers. And so, the consumer could actually add additional validation in their own pipeline. Then once that passes and they’re confident that the data that they’re using for their specific use case is valid, then they can proceed to use it in their pipeline or further reports as usual.
What happens when these pipelines are executed? Whether they’re producer pipelines or they’re consumer pipelines and we get those validation results? First off, like I mentioned before, immediately if we discover any bad results, any failed contracts, we want to alert the users as soon as possible. Sometimes if the data set is really important and there are lots of downstream users waiting on it, not only will the producer be notified so they could start looking into the data immediately. But the consumers can also subscribe to these alerts so that they at least know that something wasn’t expected, and while it’s being resolved, the data set might be delayed for today.
The second step that we just talked about is making sure that we’re integrating the validation into the full pipeline so that we could prevent this bad data from propagating. It’s not enough to just detect, hey, half of the values are missing for some reason today, if we continue to write that to our table and then downstream users continue using it. We actually don’t want anyone to see that data or text and we use that data. And so, we add a step to prevent the propagation of that data. Thirdly, we want to surface those results right away through the data discovery tools that I showed you earlier. This would help users have an accurate picture of not only how is the data doing today, when I’m trying to use it today? But also how has it looked like over time and have there been consistent issues with it over time and not fully reliable for my use case?
Finally, these results are used to provide a common understanding between the people who produce the data and the people who consume the data. And so, if everyone is on the same page about what exactly is being validated, whether it fails or not, it’s a lot easier to have that communication between the producer and the consumer. The consumer could say, “Actually, there’s a really crucial thing that I need, can you please add that to your list of validations?” The producer could similarly let the consumer know, “Hey, this new field has been added. I’ve gone ahead and added an additional validation here, and these are the values that you could expect from that field.” It really helps with everyone being on the same page about the data that is being generated and used.
These platforms are already in use. We’re already seeing a lot of benefit over the last few years, actually, and we have lots of exciting work coming up to expand it to grow them. First of all, we want tighter integration between the components of our platform. For example, our rule based contracts are currently separate from the anomaly detection part of our platform and we could get some cool functionality if we integrate them. For example, contracts that would say, if three of these models said that their results were good but the fourth one said that they were bad, we can still choose to proceed based on that and combine them in more flexible ways.
We also want to expand both of our libraries. For the contract role-based library, we’ve recently gotten some requests about adding statistics based contracts like skew, and kurtosis, and standard deviations. Overall, we have teams that have different contract requirements that are also free to contribute to our platform. Then for the live platform, we want to grow it by incorporating user feedback. A user would be able to say, actually, this was not an anomaly, and here’s why. Then we would be able to update our results to start with that. We also want to support non time series based anomaly detection and be able to monitor model performance within machine-learning flows.
Thirdly, we want to be able to use the data lineage that Smit talked about earlier to expand our platform from just detecting issues to actually being able to diagnose them. If we have column level lineage, that would make it a lot easier for us to identify the root cause of a particular problem. Sometimes, a problem, you could see the symptoms and a data set, but the actual problem started three data sets back. And so, if we have a lineage, it would be able to point us to where the validation should be added at the most upstream possible step. Finally, we want to extend both of our platforms to support streaming data. Smit showed you this as part of the architecture diagram and where we’re currently working on these functionalities.
What are some key takeaways that we’ve learned as we’ve built this platform and also as our users have onboarded to it and used it for collaboration between the producers and consumers? First of all is that we always want to keep our five pillars in mind as we’re growing the platform and making recommendations about its use. The five pillars that we’re talking about are, standardization of the rules and of the mechanism in which the rules are validated. Visibility into the results so that data health is democratized and available for both producers and consumers to monitor. We want to support evolution, because a data set isn’t created in one day and then stays the same for two years. It evolves over time, and so we want to be able to evolve the data rules over time.
We always want alerting at the top of mind so that data or engineers and scientists are able to resolve the issues as soon as possible, and we can incorporate it into our corotations. Finally, we want to expand our data lineage because lineage is so important for detecting all the different uses of a data set, and to be able to identify the root causes of any issues that arise. One of the other takeaways that we’ve had is that it’s so important to alert as soon as possible to allow a proactive response from whoever owns that data set. A lot of our pipelines have many different pre processing steps, so there’s big ingestion if the data comes from a different team. That can be deduplication, or splitting it into multiple different data sets, some other cleaning or joining.
And so, we’ve discovered that we want to add our validation at the earliest possible stage, and then every stage after that where it makes sense. Because if your pipeline goes through those five different steps and then you discover that there’s an issue, that adds time to the resolution. Then also it just wastes computational resources because you’re going to have to do those five steps over again after you fix the issue. We’ve also been really happy to see that producing quality data and surfacing these quality metrics increases people’s trust in the data that they’re using, and lets them feel more confident with the decisions that we’re making. One of our user teams let us know that 80% of their data quality issues have gone away since onboarding with our platform, and they feel a lot better sharing this data with all their consumers and using it for really important decisions.
Finally, this is probably the biggest takeaway, is that data quality is a shared responsibility. This was an interesting technical problem that we’re solving and we’re continuing to work on, but it’s also a pretty big social problem. If teams are producing data but they’re not performing any analytics on it, they’re less incentivized to care about it. And so, we really wanted to make sure that the producers and the consumers can be brought together so that they have this shared common language of how to maintain the data quality. And, understand that they can really collaborate together and they could be looking at the exact same results and the exact same rules and making sure that they’re both contributing to those in order to be successful.
Thank you everybody for attending our presentation about how we’re democratizing data quality at Zillow. Please leave questions in the Q&A section and we’ll address those as soon as possible. Something I want to call out is that Zillow is growing a lot and doing lots of hiring, and we also have some open positions on our data governance team. If you’re interested in this work, I definitely encourage you to apply to those positions, and you could also reach out to Smit and I on LinkedIn if you have any questions about our team. Thank you.

Yuliana Havryshchuk

Yuliana is a software engineer on Zillow's Data Governance team. After experiencing continuous data quality challenges in critical business pipelines, she built a proof of concept that grew into a pla...
Read more

Smit Shah

Smit is a data and software engineering enthusiast. Currently working as a Senior Software Engineer, Big Data at Zillow where he is building centralized data products and democratizing data quality. H...
Read more