Scalable Bayesian Inference with Spark, SparkR, and Microsoft R Server

R has become the de facto language for statisticians. There are nearly 10,000 packages to choose from for statistical inference, visualization, and machine learning. However, the base CRAN implementation of R is burdened by numerous scalability challenges: it is single threaded and bounded by memory of a single node. In this talk, I will summarize some recent advancements in the R APIs for Spark, and show how they can be incorporated with Microsoft R Server on Spark to create a scalable machine learning platform. In particular, I will show how an R user can create functional pipelines for Spark DataFrames and RevoScaleR XDFs (external dataframes) to conduct Bayesian inference at scale, such as estimating cluster membership using Variational Consensus Monte Carlo in Gaussian mixture models, large scale topic modeling with stochastic variational inference, and finally, Bayesian estimation of Neural Networks with Stochastic Gradient Hamiltonian Monte Carlo. All examples will be developed entirely in R, and I’ll describe best practices for performance and reproducibility.

About Ali Zaidi

Ali is a data scientist in the AI Research team at Microsoft. He spends his day trying to make distributed computing and machine learning in the cloud easier, more efficient, and more enjoyable for data scientists and developers alike.