ChakraView – A 360° Approach to Data Quality

May 27, 2021 11:35 AM (PT)

Download Slides

Availability of high-quality data is central to success of any organization in the current era. As every organization ramps up its collection and storage of data, its usefulness largely depends on the confidence of its quality. In the Financial Data Engineering team at Flipkart, where the bar for the data quality is 100% correctness and completeness, this problem takes on a wholly different dimension. Currently, countless number of data analysts and engineers try to find various issues in the financial data to keep it that way. We wanted to find a way that is less manual, more scalable and cost-effective.

 

As we evaluated various solutions available in the public domain, we found quite a few gaps. 

  1. Most frameworks are limited in the kind of issues they detect. While many detect the internal consistency issues at schema level and dataset level, there are none that detect consistency issues across datasets and check for completeness.
  2. No common framework for Data cleaning and repairing once an issue has been found. 
  3. Fixing data quality issues require the right categorization of the issues to drive accountability with the producer systems. There are very few frameworks that support categorisation of issues and visibility to the producers.

 

In this presentation, we discuss how we developed a comprehensive data quality framework. Our framework has also been developed with the assumption that the people interested in and involved in fixing these issues are not necessarily data engineers. Our framework has been developed to be largely config driven with pluggable logic for categorisation and cleaning. We will then talk about how it helped achieve scale in fixing the data quality issues and helped reduce many of the repeated issues.

In this session watch:
Keerthika Thiyagarajan, Developer, Flipkart
Shankar Manian, Sr. Director of Engineering, Recko, Inc.

 

Transcript

Shankar Manian: Hello, everyone. Hope you’re all staying well and safe in this difficult environment. We are here to talk about ChakraView, our approach to data quality. Before we begin, let me give a brief background of ourselves. I’m Shankar, I lead engineering at Recko. Recko is focused on automating financial operations and providing data reconciliation as a service. And with me here today is Keerthika. She’s a tech lead at Flipkart. Flipkart is the dominant e-commerce player in India. We are going to be talking about the work we did together at Flipkart in the financial engineering team there. Between the two of us, we have many years of experience in big data. At least that’s what I would tell my friends and family. In reality, though, as many of you can relate, all we did was cleaning up data and fixing data quality issues. This project started from our effort to stop being data janitors, and to see if we can do something better.
As we found in our research, we were not unique. Data cleaning and fixing data quality issues is the least fun activity of data science. Most people come into data science imagining automated cars and robots, but in reality, they spend 80% of the time fixing data quality issues.
Why does it take so much of our time? It’s because it’s mostly an afterthought. In our excitement to ship models or provide data, we don’t pay enough attention to data quality until after it has become a major issue. By that time, the problem is so big that it becomes like searching a needle in the haystack. It costs a lot of time and resources to find the issue and its source, and even more time to fix it. What are we missing? How do we make this fun?
The problem begins with our ability to detect the data quality issues. We don’t do that early enough. Most organizations do have some kind of data quality checks. However, if you take a closer look, it doesn’t go beyond some basic checks. Maybe a null check here, or a value check there and that’s the extent of the data quality checks that we typically have. A lot of those doesn’t check for completeness. There is no confirmation that all the data from the source system has made it into the analytical system, this is often assumed.
Another aspect of quality that is not paid enough attention is consistency. In real world, data has many copies and sometime even many sources. The consistency of the data across all of these sources are never checked or rarely checked. And that is why you will see sometimes the business department is giving one sales figure whereas the finance department is giving another.
Beyond these aspects, there is one aspect that is completely ignored is auditability. To ultimately build the confidence in the data quality, we need to be able to prove the quality of the data at its most granular level. We need to be able to trace any issue to its source system. And that is typically missing.
Detecting data quality issues properly only tells how much of a problem you have. However, as we saw earlier, cleansing the data is the most time-consuming part. As your detection improves, you are going to be spending even more time cleaning the data. Why is that? It requires a lot of time and analysis on a large amount of data to do the RCA and find the source of the issues. And even after finding the source of the issues, there is not clear SOP to fix it. It requires a lot of ad-hoc scripting to figure out a way to fix it. The individual steps are highly repetitive, making it susceptible to automation.
However, it is not that easy to automate as the specific issue you are dealing with is not often repetitive. The individual steps are, but the overall issue is a new issue every time. We need a framework which can be plugable to create a custom solution for each issues with individual steps that are automated. We will go more into this later, as that is precisely the approach we took.
Regardless of how much sophistication we put in detecting and cleaning and data quality issues, ultimately it is a reactive process. How do we prevent the data quality issues from happening in the first place? And that is the key question we have to answer ultimately. Most data produced by software systems are often unaware that they are producing bad data and the cost of that. And when they do come to know, there are so many issues and they don’t know where to start. Data quality frameworks also has to drive accountability with the producers of these systems of these data and push the quality upstream on the systems by providing visibility on the cost of the bad data and the impact and prioritization of which issues are the most important ones to fix. With that context, let me call upon Keerthika now to talk about how we address these gaps at Flipkart.

Keerthika Thiya…: Hello everyone. Thank you, Shankar for the introduction.
How many of us here are spending a lot of time finding the issues doing RCAs and cleaning data? We at financial engineering team at Flipkart we’re in the same situation. Let me start by explaining how data quality was measured previously. Our stakeholders, including the finance team, were tracking all the business metrics using the financial reports delivered by us, the financial engineering team. They used to find issues in the reports and highlight the gaps present in these financial reports. By the time gaps were highlighted, the impact generally goes to a few millions. Checking one issue generally led to uncovering more issues and this became a cycle.
We started our journey by figuring out all the granular validations that were done to crack all these business metrics. Along with that, we also discovered multiple requirements on validating the data. First, the validations were very dynamic and changing with the business needs hence, self serve onboarding became an important feature to allow stakeholders to onboard validations directly into the platform. The first requirement was to run these validations on the data as soon as it’s made available. But generally is the time data gets refreshed in the analytical stool. We wanted to provide confidence in the data we were serving our stakeholders, hence visibility into the gaps present in the system on a periodic basis became a must have. A system health metrics dashboard became our important feature in providing visibility.
Let me introduce the kind of data issues present in the system. All the bad records containing issues are highlighted in red. The sample report shown here is a combination of bank statement and ledger entries that are created. This report has multiple issues like amount mismatch, entries missing in one of the system. These bad records were contributing to the gaps present in the business metrics.
The understanding that we gained from going ahead, figuring out the business metrics and also the data contributing to these business metrics impact, we were able to come up with our library of templates. These abstract templates included standard templates for null checks, data type checks, aggregated checks. In addition, there were range checks by which the amount must lie in a given range. There were cross comparison checks by which comparing more than one data set, like combining the bank statement and the ledger as shown in the previous example.
As soon as we were able to come up with the templates, we needed to add filtering logic and transformation logic. Filtering logic enables us to exclude few records that are not valid for this particular validation. Transformation logic is added to convert data into a specific format. For example, converting a column containing float values to have to precision. The next feature was to build the target data frame to validate. This was done by adding joining one or more facts and applying group by operations as needed. The next feature was to emit all the validation failure present per row in the target data frame.
All these features were exposed in a new way to onboard validations by choosing the target data set, template, filter and transformation logics. This UI made onboarding of validations self serve. These validations get stored in the data store as shown here. This configuration stores how to build the target data set, what kind of operations and checks to be applied on each role in the target data frame.
Let me explain how the data flows in the pipeline. Once fact refresh is done indicating the data is fresh, we get it triggered from our scheduler, which is Azkaban. A spark job is triggered by combining the obstruct validations template and the configurations stored in the data store. The output to this spark job is the validation failures in a granular manual. The bad records along with the validation failures per row are stored into the data store. This data is used for powering the system health dashboard. Similar to this, the [inaudible] health dashboard is shown here. This dashboard provides an insight on the data quality at any given point of time. This enabled no surprises after the financial reports were generated. The stakeholders were very well aware of the quality of data at any given point of time. Tracking business metrics proactively became much easier. The impact of the validation failures and the number of affected rules that are available at any given point of time. Many important financial decisions were taken using the same. The corrected measures were taken by stakeholders using the data present in this dashboard.
This kind of visibility was great but it led to our next problem of finding the cause of the validation failures and fixing them. This was taking a lot of our effort. The first step in fixing any issue is to categorize them. By categorization, I mean assigning a bucket to the bad records found in the system. As we can see in this example, each record is assigned a category ranging from wrong amount in one of the system. Even processing issue due to an upstream API failure to issues like file upload issues. This RCA was done on the financial data pipeline whose architecture resemble similar to the architecture shown here. As we can see, there are multiple microservices, Kafka queues and multiple data stores. The data flows into the analytical store at the end of the pipeline where all the financial reports were generated. Each of these components could be a failure point in the system.
Debugging this complex pipeline became very time consuming. A person has to track all the changes happening in the pipeline to successfully do an RCA. The complexity was leading to produce wrong RCA a lot of time. Doing this entire RCA process manually was a very costly process. Let me put forth question here. How many of us here hated doing RCAs manually and documenting the RCAs? I was one such person. How many managers here got fed up with following up with the engineers to create runbooks and document the RCAs? Shankar was one such person. I hope everyone here will be able to relate to the situation.
I still remember the day when Shankar came to me saying, Keerthika, why can’t we make this operation cheaper? Why don’t we automate this entire RCA process? I get my search to find any ELK kind of tool for big data but unfortunately, I was not able to find any such tools in the open community. This led to our journey in building an auto RCA. Did we start building this platform as a standalone platform? No. The first step in building the auto RCA tool was not building the platform from scratch, but was to make the financial data pipeline better. We started by adding structured log to the components in the pipeline. This structured log enabled us to automate the categorization by using the logs produced in the data pipeline.
By structured log, I mean a log similar to the one shown here. This log is produced by one of our microservices present in our financial pipeline. This log shows that an event has failed while processing and the reason for failure is specified in the other details. In this example, an APA request had failed. The APA details along with the error response is captured in the log.
This structured logging enabled us to do a 5-why kind of an RCA. We decided to do a hierarchical categorization, which produces multiple leaf categories. Each of the leaf category represents a unique issue present in the system. If two issues in the same category, have some kind of differences, then we generally break the same into multiple buckets. The logic to-do categorization is pluggable and can be added with a simple API implementation and a conflict change.
Sample categorization for even processing failure bucket is shown here. All the records start with clan classified as a root category. Let’s take an example of the failure in event processing. What was the issue present in the bad record? There was a missing entry. Where was the missing entry? It was present in the ledger system. Why did the missing entry happen? It was due to uneven processing failure. Thus, we were able to reach to a leaf category. This leaf category was assigned as a category to the red card. We were able to achieve a 5-why RCA using structure logging and structure audit trails added to the financial pipeline.
Now, along with the visibility into what were the granular gaps present in the system, we were also able to provide an insight on why the gap occurred. We were able to add one more drill down to our validation to show why the validation failure happened in the system. This dashboard enabled us to concentrate on the bucket, which will provide better data quality.
In the financial world, even a few dollars cannot be ignored. This dashboard helped us in identify even the small gaps present in the system in a structured way. Most of the data pipeline allows us to ignore bad records by eliminating them from the data set. But in financial world, it’s not possible. We are supposed to clean all the data quality issues to be able to provide accurate financial reports.
The next question we asked ourselves was if we can automate this cleaning process too. We started automating the cleaning process by assigning what we call a recipe to each of the leaf categories. Assigning an unique category, unique issue to the category became a must have feature. If there were more than one ways to fix the data in the same bucket, we ended up breaking the category into multiple buckets. This way we were able to achieve a one is to one relationship between the leaf category and the recipe. We were able to map the same using a simple configuration as shown here. We were able to create a library of recipes, which was the code written to fix the data. Each leaf category was mapped to one recipe. The recipe library provided a vast set of functionalities. What are some of the cleaning patterns we found and formed recipes for? We created recipes to reverse a financial transaction, a recipe to retry the pipeline once the fix is done, to correct the financial transaction. A recipe with more common use case of restoring data from the cold store.
To summarize, this is the architecture of the pipeline we built. We had a validation job detecting bad records from the analytical store. These bad records were bought into our data store on top of which enriching locks and categorization into multiple buckets were done. Recipes were added to fix the data by slicing and dicing the data present in the data store. We build multiple dashboards to provide visibility to our users and stakeholders. Using this approach, we were able to reduce the time taken for the entire process from a few days to a few hours, which is the time taken to run the pipeline. We were able to make the entire process of people independent one. This became a developer friendly process since everything is automated. This also provided a complete visibility of data quality at any point of time thus gaining trust from her stakeholders.
Are we done in our journey for data quality? No, we still have some improvements. We want to open source the framework to extend and seek feedback from the community. In terms of feature, we plan on building a data observability platform. Given a record, we should be able to track if an issue is present and why the issue happened. This is needed as often audit trail on a specific record is needed for complaints purposes. We want to do some performance improvements too. We currently handle hundreds of validations. These number of validations is growing to go to 1000 soon. We are expecting exponential growth in the number of validations. Also, the pipeline takes a few hours to run. We want to try and make it near real time by reducing it to a few minutes. That has been our journey in improving data quality so far. Thank you everyone for the opportunity. Shankar and myself will be available for taking any questions. Thank you.

Keerthika Thiyagarajan

Keerthika Thiyagaran, currently a Software Development Engineer 3 has been working in Flipkart Financial Data Engineering team for the past 5 years.
Read more

Shankar Manian

Shankar leads the Engineering at Recko. Recko has built a Financial Operations Platform and provide data reconciliation as a service to modern internet companies. In a career spanning 20+ years, he ha...
Read more