Sean is a principal solutions architect focusing on machine learning and data science at Databricks. He is an Apache Spark committer and PMC member, and co-author Advanced Analytics with Spark. Previously, he was director of Data Science at Cloudera and an engineer at Google.
May 27, 2021 11:00 AM PT
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
May 27, 2021 03:15 PM PT
In this session, you will learn how to scale their exploratory data analysis and data science workflows with Databricks. You will learn how you can collaborate with team members writing code in different languages (Python, R, Scala, SQL) using Databricks Workspace, explore data with interactive visualizations, and discover new insights, securely share code with co-authoring, commenting, automatic versioning, Git integrations, and role-based access controls. You will learn best practices for managing experiments, projects, and models using MLflow. Attendees will build a pipeline to log and deploy machine learning models to production.
This session will be "follow along" - you are welcome to try running the notebooks yourself 'live', but it is not required. They can be re-run later as well. If you want to follow along, download the notebooks from https://files.training.databricks.com/classes/data-science-on-databricks/ . We recommend downloading the version with solutions.
For access to Databricks, sign up for free at https://community.cloud.databricks.com/ . Import the notebooks and provision a cluster using Databricks runtime 7.3 ML.
November 18, 2020 04:00 PM PT
Solving a data science problem is about more than making a model. It entails data cleaning, exploration, modeling and tuning, production deployment, and workflows governing each of these steps. In this simple example, we’ll take a look at how health data can be used to predict life expectancy. It will start with data engineering in Apache Spark, data exploration, model tuning and autologging with hyperopt and MLflow. It will continue with examples of how the model registry governs model promotion, and simple deployment to production with MLflow as a job or REST endpoint. This tutorial will cover the latest innovations from MLflow 1.12.
Speaker: Sean Owen
June 24, 2020 05:00 PM PT
Solving a data science problem is about more than making a model. It entails data cleaning, exploration, modeling and tuning, production deployment, and workflows governing each of these steps. In this simple example, we'll take a look at how health data can be used to predict life expectancy. It will start with data engineering in Apache Spark, data exploration, model tuning and logging with hyperopt and MLflow. It will continue with examples of how the model registry governs model promotion, and simple deployment to production with MLflow as a job or dashboard.
June 23, 2020 05:00 PM PT
Deep learning sometimes seems like sorcery. Its state-of-the-art applications are at times delightful and at times disturbing. It's no wonder that companies are eager to apply deep learning for more prosaic business problems like better churn prediction, image curation, chatbots, time series analysis and more. This talk won't examine how to tune a deep learning architecture for accuracy. This talk will instead walk through basic steps to avoid common performance pitfalls in training, and then the right steps, in order, to scale up by applying more complex tooling and more hardware. Hopefully, you will find your modeling job can move along much faster without reaching immediately for a cluster of extra GPUs.
January 19, 2022 08:40 PM PT
Careful with that modeling tool! Even the simplest data analysis problems can have surprising statistical subtleties, which can lead the aspiring data scientist to the wrong conclusions from data. This talk will examine three straightforward scenarios where many answers seem correct. It will examine how the notion of causality helps resolve all of them, and briefly explore the power of graphical models and Judea Pearl's do-calculus.
By the end of this session, you will be more cautious and careful with the modeling tool, and learn that correlation is not always causation.
Session hashtag: #SAISDS3