Sean is a data scientist at Databricks. He is an Apache Spark committer and PMC member, and co-author Advanced Analytics with Spark. Previously, he was director of Data Science at Cloudera and an engineer at Google.
Solving a data science problem is about more than making a model. It entails data cleaning, exploration, modeling and tuning, production deployment, and workflows governing each of these steps. In this simple example, we'll take a look at how health data can be used to predict life expectancy. It will start with data engineering in Apache Spark, data exploration, model tuning and logging with hyperopt and MLflow. It will continue with examples of how the model registry governs model promotion, and simple deployment to production with MLflow as a job or dashboard.
Deep learning sometimes seems like sorcery. Its state-of-the-art applications are at times delightful and at times disturbing. It's no wonder that companies are eager to apply deep learning for more prosaic business problems like better churn prediction, image curation, chatbots, time series analysis and more. This talk won't examine how to tune a deep learning architecture for accuracy. This talk will instead walk through basic steps to avoid common performance pitfalls in training, and then the right steps, in order, to scale up by applying more complex tooling and more hardware. Hopefully, you will find your modeling job can move along much faster without reaching immediately for a cluster of extra GPUs.
Careful with that modeling tool! Even the simplest data analysis problems can have surprising statistical subtleties, which can lead the aspiring data scientist to the wrong conclusions from data. This talk will examine three straightforward scenarios where many answers seem correct. It will examine how the notion of causality helps resolve all of them, and briefly explore the power of graphical models and Judea Pearl's do-calculus. By the end of this session, you will be more cautious and careful with the modeling tool, and learn that correlation is not always causation. Session hashtag: #SAISDS3