Bridging the Completeness of Big Data on Databricks

May 27, 2021 11:00 AM (PT)

Download Slides

Data completeness is key for building any machine learning and deep learning model. The reality is that outliers and nulls widely exist in the data. The traditional methods of using fixed values or statistical metrics (min, max and mean) does not consider the relationship and patterns within the data. Most time it offers poor accuracy and would introduce additional outliers. Also, given our large data size, the computation is an extremely time-consuming process and a lot of time it could be constrained by the limited resource on local computer. 
To address those issues, we have developed a new approach that will first leverage the similarity within our data points based on the nature of data source then using a collaborative AI model to fill null values and correct outliers.
In this talk, we will walk through the way we use a distributed framework to partition data by KDB tree for neighbor discovery and a collaborative filtering AI technology to fill the missing values and correct outliers. In addition, we will demonstrate how we reply on delta lake and MLflow for data and model management.

In this session watch:
Chao Yang, Data Scientist, Verisk
Yanyan Wu, VP, Data and Data Analytics, Verisk


Chao Yang

Chao Yang is an avid big data professional, focusing on big data engineering and applying machine learning/deep learning technologies in solving business and engineering problems. He started his caree...
Read more

Yanyan Wu

Yanyan Wu is an innovative technology leader who had years of engineering design, R&D management, product portfolio management, and software development experience before shifting to data science worl...
Read more