Javier holds a double degree in Math and Software Engineer and decades of industry experience with a focus on data analysis. He currently works in RStudio and previously in Microsoft Research and SAP.
In this talk you will learn how to easily configure Apache Arrow with R on Apache Spark, which will allow you to gain speed improvements and expand the scope of your data science workflows; for instance, by enabling data to be efficiently transferred between your local environment and Apache Spark. This talk will present use cases for running R at scale on Apache Spark. It will also introduce the Apache Arrow project and recent developments that enable running R with Apache Arrow on Apache Spark to significantly improve performance and efficiency. We will end this talk by discussing performance and recent development in this space.
This session will start with a recap of what sparklyr is, and how it can be used to analyze, visualize and perform machine learning in Spark from R. We will walk through installation, configuration, data wrangling with SQL or dplyr, modeling in MLlib or H2O, and extending sparklyr by calling Scala functions from R or writing Scala modules accessible from R. You'll then get a detailed update on new sparklyr features. After sparklyr 0.4 was released to CRAN last year, RStudio released 0.5, which implements new connections, features and architecture changes worth reviewing. We will wrap up with a discussion of uses cases relevant in the R ecosystem. The uses cases will demonstrate how to model data using popular frameworks in the R ecosystem that in seamless interactions between Spark and R using sparklyr. Session hashtag: #SFdd8Learn more: