Data is the key ingredient to building high-quality, production AI applications. It comes in during the training phase, where more and higher-quality training data enables better models, as well as during the production phase, where understanding the model’s behavior in production and detecting changes in the predictions and input data are critical to maintaining a production application. However, so far most data management and machine learning tools have been largely separate.
In this presentation, we’ll talk about several efforts from Databricks, in Apache Spark, as well as other open source projects, to unify data and AI in order to make it significantly simpler to build production AI applications.
Session hashtag: #SAISAI2
Tim Hunter is a senior AI specialist at the ABN AMRO Bank. He was an early software engineer at Databricks and has contributed to the Apache Spark MLlib project, and he has co-created the Koalas, GraphFrames, TensorFrames and Deep Learning Pipelines libraries. He holds a Ph.D. in Machine Learning from UC Berkeley and he has been building distributed Machine Learning systems with Spark since version 0.0.2, before Spark was an Apache Software Foundation project.
Xiangrui Meng is an Apache Spark PMC member and a software engineer at Databricks. His main interests center around developing and implementing scalable algorithms for scientific applications. He has been actively involved in the development and maintenance of Spark MLlib since he joined Databricks. Before Databricks, he worked as an applied research engineer at LinkedIn, where he was the main developer of an offline machine learning framework in Hadoop MapReduce. His Ph.D. work at Stanford is on randomized algorithms for large-scale linear regression problems.