Yanyan Wu is an innovative technology leader who had years of engineering design, R&D management, product portfolio management, and software development experience before shifting to data science world. She holds a Ph.D. in mechanical engineering from Arizona State University and an MBA from Kelly business school at Indiana University. She authored/co-authored 12 US and international patents on industrial design and manufacturing technologies as well as on data analytics. She led the teams in GE and Halliburton before joining the energy division of Verisk as VP of data and data system. She had the passion on big data analytics and 2D,3D data visualization, VR/AR technologies, CAD design. She enjoyed working with top talents to advance big data analytics technologies and to apply them on solving impactful business problems to save the cost and improve efficiency.
May 27, 2021 11:00 AM PT
Data completeness is key for building any machine learning and deep learning model. The reality is that outliers and nulls widely exist in the data. The traditional methods of using fixed values or statistical metrics (min, max and mean) does not consider the relationship and patterns within the data. Most time it offers poor accuracy and would introduce additional outliers. Also, given our large data size, the computation is an extremely time-consuming process and a lot of time it could be constrained by the limited resource on local computer.
To address those issues, we have developed a new approach that will first leverage the similarity within our data points based on the nature of data source then using a collaborative AI model to fill null values and correct outliers.
In this talk, we will walk through the way we use a distributed framework to partition data by KDB tree for neighbor discovery and a collaborative filtering AI technology to fill the missing values and correct outliers. In addition, we will demonstrate how we reply on delta lake and MLflow for data and model management.