Xuan Wang

Data Scientist, Databricks

Xuan Wang is a data scientist/engineer at Databricks. He is working on building data products and ETL pipelines on top of Databricks’ Unified Analytic Platform and Apache Spark. Prior to joining Databricks, he was a postdoctoral researcher working on probabilistic models in random graphs and random medium. He received his Ph.D. in Statistics from The University of North Carolina at Chapel Hill in 2014.

Past sessions

Summit 2019 Databricks: What We Have Learned by Eating Our Dog Food

April 24, 2019 05:00 PM PT

Databricks Unified Analytics Platform (UAP) is a cloud-based service for running all analytics in one place - from highly reliable and performant data pipelines to state-of-the-art Machine Learning. From the original creators of Apache Spark and MLflow, it provides data science and engineering teams ready to use pre-packaged clusters with optimized Apache Spark and various ML frameworks coupled with powerful collaboration capabilities to improve productivity across the ML lifecycle. Yada yada yada... But in addition to being a vendor Databricks is also a user of UAP.

So, what have we learned by eating our own dogfood? Attend a “from the trenches report” from Suraj Acharya, Director Engineering responsible for Databricks’ in-house data engineering team how his team put Databricks technology to use, the lessons they have learned along the way and best practices for using Databricks for data engineering.

Summit 2018 Cloud Cost Management and Apache Spark

June 5, 2018 05:00 PM PT

The cloud computing market is growing faster than virtually any other IT market today, according to Gartner [1]. Providing a unified analytics platform in public clouds, Databricks invests heavily in cloud computing. As a result, cloud expense becomes an imperative category of our cost of goods sold (COGS) and operating expense (OPEX). Many companies share the same story as ours, embracing the cloud while facing the raising challenge of managing its cost.

In this session, we will share our experience on cloud cost management, from mistakes we made, data garnered, lessons learned, to the solutions we built. We will discuss general principles of managing accounts and services and assigning budget and attributing cost to internal teams. Using AWS as a concrete example, with Databricks and Spark as part of our solution, we will show how we: 1) make AWS cost and usage data available to finance and budget owners, 2) build data products that help budget owners to monitor the cost and take actions by buying reserved instances and setting retention policies, 3) use data science techniques to detect changes and do forecast. The general principles and solutions we built are applicable to other cloud providers too.

Session hashtag: #DSSAIS13