Chao Yang is an avid big data professional, focusing on big data engineering and applying machine learning/deep learning technologies in solving business and engineering problems. He started his career in cancer research, where he investigated the intricate tumor metabolism in lung cancer and the immunotherapy in various cancer types. Prior to joining Verisk, he worked at Halliburton as a data scientist where he implemented multiple big data solutions in processing and analyzing Oil & Gas data. He holds two master’s degrees in Cancer Biology from The University of Texas MD Anderson Cancer Center and Computer Science from University of Houston.
May 27, 2021 11:00 AM PT
Data completeness is key for building any machine learning and deep learning model. The reality is that outliers and nulls widely exist in the data. The traditional methods of using fixed values or statistical metrics (min, max and mean) does not consider the relationship and patterns within the data. Most time it offers poor accuracy and would introduce additional outliers. Also, given our large data size, the computation is an extremely time-consuming process and a lot of time it could be constrained by the limited resource on local computer.
To address those issues, we have developed a new approach that will first leverage the similarity within our data points based on the nature of data source then using a collaborative AI model to fill null values and correct outliers.
In this talk, we will walk through the way we use a distributed framework to partition data by KDB tree for neighbor discovery and a collaborative filtering AI technology to fill the missing values and correct outliers. In addition, we will demonstrate how we reply on delta lake and MLflow for data and model management.