Deep Natural Language Processing with R and Apache Spark - Databricks

Deep Natural Language Processing with R and Apache Spark

Neural embeddings (Bengio et al. (2003), Olah (2014)) aim to map words, tokens, and general compositions of text to vector spaces, which makes them amenable for modeling, visualization, and inference. In this talk, we describe how to use neural embeddings of natural and programming languages using R and Spark. In particular, we’ll see how the combination of a distributed computing paradigm in Spark with the interactive programming and visualization capabilities in R can make exploration and inference of natural language processing models easy and efficient. Building upon the tidy data principles formalized and efficiently crafted in Wickham (2014), Silge and Robinson (2016) have provided the foundations for modeling and crafting natural language models with the tidytext package.

In this talk, we’ll describe how we can build scalable pipelines within this framework to prototype text mining and neural embedding models in R, and then deploy them on Spark clusters using the sparklyr and the RevoScaleR packages. To describe the utility of this framework we’ll provide an example where we’ll train a sequence to sequence neural attention model for summarizing git commits, pull request and their associated messages (Zaidi (2017), and then deploy them on Spark clusters where we will then be able to do efficient network analysis on the neural embeddings with a sparklyr extension to GraphFrames.

 

About Ali Zaidi

Ali is a data scientist in the language understanding team at Microsoft AI Research. He spends his days trying to make tools for researchers and engineers to analyze large quantities of language data efficiently in the cloud and on clusters. Ali studied statistics and machine learning at the University of Toronto and Stanford University.