ChakraView – A 360° Approach to Data Quality

Availability of high-quality data is central to success of any organization in the current era. As every organization ramps up its collection and storage of data, its usefulness largely depends on the confidence of its quality. In the Financial Data Engineering team at Flipkart, where the bar for the data quality is 100% correctness and completeness, this problem takes on a wholly different dimension. Currently, countless number of data analysts and engineers try to find various issues in the financial data to keep it that way. We wanted to find a way that is less manual, more scalable and cost-effective.

 

As we evaluated various solutions available in the public domain, we found quite a few gaps. 

  1. Most frameworks are limited in the kind of issues they detect. While many detect the internal consistency issues at schema level and dataset level, there are none that detect consistency issues across datasets and check for completeness.
  2. No common framework for Data cleaning and repairing once an issue has been found. 
  3. Fixing data quality issues require the right categorization of the issues to drive accountability with the producer systems. There are very few frameworks that support categorisation of issues and visibility to the producers.

 

In this presentation, we discuss how we developed a comprehensive data quality framework. Our framework has also been developed with the assumption that the people interested in and involved in fixing these issues are not necessarily data engineers. Our framework has been developed to be largely config driven with pluggable logic for categorisation and cleaning. We will then talk about how it helped achieve scale in fixing the data quality issues and helped reduce many of the repeated issues.

About Keerthika Thiyagarajan

Keerthika Thiyagaran, currently a Software Development Engineer 3 has been working in Flipkart Financial Data Engineering team for the past 5 years.

About Shankar Manian

Shankar leads the Engineering at Recko. Recko has built a Financial Operations Platform and provide data reconciliation as a service to modern internet companies. In a career spanning 20+ years, he has built a variety of distributed systems. At LinkedIn, he led the optimization and productivity improvements of their Hadoop platform. Before that, he was with Microsoft, where he helped build the middle tier platform for Bing Search and a highly successful distributed test automation for Windows clusters. Many of his recent works are presented in major industry conferences like Kafka Summit, Spark Summit, DataWorks summit and many big data meetups in Bangalore and Bay Area.