Deep Natural Language Processing with R and Apache Spark

Neural embeddings (Bengio et al. (2003), Olah (2014)) aim to map words, tokens, and general compositions of text to vector spaces, which makes them amenable for modeling, visualization, and inference. In this talk, we describe how to use neural embeddings of natural and programming languages using R and Spark. In particular, we’ll see how the combination of a distributed computing paradigm in Spark with the interactive programming and visualization capabilities in R can make exploration and inference of natural language processing models easy and efficient. Building upon the tidy data principles formalized and efficiently crafted in Wickham (2014), Silge and Robinson (2016) have provided the foundations for modeling and crafting natural language models with the tidytext package. In this talk, we’ll describe how we can build scalable pipelines within this framework to prototype text mining and neural embedding models in R, and then deploy them on Spark clusters using the sparklyr and the RevoScaleR packages. To describe the utility of this framework we’ll provide an example where we’ll train a sequence to sequence neural attention model for summarizing git commits, pull request and their associated messages (Zaidi (2017)), and then deploy them on Spark clusters where we will then be able to do efficient network analysis on the neural embeddings with a sparklyr extension to GraphFrames. References Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. “A Neural Probabilistic Language Model.” J. Mach. Learn. Res. 3 (March). JMLR.org: 1137-55. http://dl.acm.org/citation.cfm?id=944919.944966. Olah, Christopher. 2014. “Deep Learning, NLP, and Representations.” https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/. Silge, Julia, and David Robinson. 2016. “Tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS 1 (3). The Open Journal. doi:10.21105/joss.00037. Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (1): 1-23. doi:10.18637/jss.v059.i10. Zaidi, Ali. 2017. “Summarizing Git Commits and Github Pull Requests Using Sequence to Sequence Neural Attention Models.” CS224N: Final Project,
Session hashtag: EUds2

About Ali Zaidi

Ali is a data scientist in the AI Research team at Microsoft. He spends his day trying to make distributed computing and machine learning in the cloud easier, more efficient, and more enjoyable for data scientists and developers alike.