In this presentation we’ll explain how to use the R programming language with Spark using a Databricks notebook and the SparkR package. We’ll discuss how to push data wrangling to the Spark nodes for massive scale and how to bring it back to a single node so we can use open source packages on the data. We’ll demonstrate converting SQL tables into R distributed data frames and how to convert R data frames to SQL tables. We’ll also have a look at how to train predictive models using data distributed over the Spark nodes. Bring your popcorn. This is a fun and interesting presentation.
Bryan Cafferky is a Microsoft Data Science and AI Technical Solutions Professional focused on helping healthcare customers understand and implement Data Analysis, Machine Learning, and AI solutions. He is a Microsoft 2017 Data Platform MVP and a 2016 Cloud and Data Center Management MVP. Bryan is the author of Pro PowerShell for Database Developers by Apress, available on Amazon. He leads The RI Microsoft BI User Group, and The Greater Boston Area Data Science, Machine Learning, and AI Group. He has been working with the SQL Server stack since 1997 and implemented projects in the banking, insurance, e-commerce, and utilities.