Ali is a data scientist in the language understanding team at Microsoft AI Research. He spends his days trying to make tools for researchers and engineers to analyze large quantities of language data efficiently in the cloud and on clusters. Ali studied statistics and machine learning at the University of Toronto and Stanford University.
What is the equivalent of the English phrase, "say what?" in Japanese? In this talk, we provide an intuitive approach to learning such distributed representations of phrases from multilingual data in a novel way using autoencoders and generative neural networks. Distributed representations of language are a very natural way to encode relationships between words and phrases. Such representations map discrete representations to continuous vectors, and frequently encode useful semantics of the linguistic units of the underlying language corpus, making them ubiquitous in NLP tasks. However, for most machine translation tasks, large amounts of parallel corpora are needed to learn semantic relationships between pairs of language-phrases, which can be problematic without aligned data. This talk will provide an examination of distributed representations using neural embeddings, with particular focus on the use of generative models and auto-encoders for learning shared word and phrase representations across languages. We show how we can speed up learning shared latent representations using Spark, and discuss techniques for optimizing phrase-alignment using active learning. Session hashtag: #AISAIS16
R has become the de facto language for statisticians. There are nearly 10,000 packages to choose from for statistical inference, visualization, and machine learning. However, the base CRAN implementation of R is burdened by numerous scalability challenges: it is single threaded and bounded by memory of a single node. In this talk, I will summarize some recent advancements in the R APIs for Spark, and show how they can be incorporated with Microsoft R Server on Spark to create a scalable machine learning platform. In particular, I will show how an R user can create functional pipelines for Spark DataFrames and RevoScaleR XDFs (external dataframes) to conduct Bayesian inference at scale, such as estimating cluster membership using Variational Consensus Monte Carlo in Gaussian mixture models, large scale topic modeling with stochastic variational inference, and finally, Bayesian estimation of Neural Networks with Stochastic Gradient Hamiltonian Monte Carlo. All examples will be developed entirely in R, and I'll describe best practices for performance and reproducibility.
Apache Spark provides an elegant API for developing machine learning pipelines that can be deployed seamlessly in production. However, one of the most intriguing and performant family of algorithms – deep learning – remains difficult for many groups to deploy in production, both because of the need for tremendous compute resources and also because of the inherent difficulty in tuning and configuring. In this session, you'll discover how to deploy the Microsoft Cognitive Toolkit (CNTK) inside of Spark clusters on the Azure cloud platform. Learn about the key considerations for administering GPU-enabled Spark clusters, configuring such workloads for maximum performance, and techniques for distributed hyperparameter optimization. You'll also see a real-world example of training distributed deep learning learning algorithms for speech recognition and natural language processing.Microsoft Cognitive Toolkit (CNTK) inside of Spark clusters on the Azure cloud platform. We'll discuss the key considerations for administering GPU-enabled Spark clusters, configuring such workloads for maximum performance, and techniques for distributed hyperparameter optimization. We'll illustrate a real-world example of training distributed deep learning learning algorithms for speech recognition and natural language processing. Session hashtag: #SFds13Additional Reading:
There's a growing number of data scientists that use R as their primary language. While the SparkR API has made tremendous progress since release 1.6, with major advancements in Apache Spark 2.0 and 2.1, it can be difficult for traditional R programmers to embrace the Spark ecosystem. In this session, Zaidi will discuss the sparklyr package, which is a feature-rich and tidy interface for data science with Spark, and will show how it can be coupled with Microsoft R Server and extended with it's lower-level API to become a full, first-class citizen of Spark. Learn how easy it is to go from single-threaded, memory-bound R functions to multi-threaded, multi-node, out-of-memory applications that can be deployed in a distributed cluster environment with minimal amount of code changes. You'll also get best practices for reproducibility and performance by looking at a real-world case study of default risk classification and prediction entirely through R and Spark. Session hashtag: #SFeco1
Neural embeddings (Bengio et al. (2003), Olah (2014)) aim to map words, tokens, and general compositions of text to vector spaces, which makes them amenable for modeling, visualization, and inference. In this talk, we describe how to use neural embeddings of natural and programming languages using R and Spark. In particular, we'll see how the combination of a distributed computing paradigm in Spark with the interactive programming and visualization capabilities in R can make exploration and inference of natural language processing models easy and efficient. Building upon the tidy data principles formalized and efficiently crafted in Wickham (2014), Silge and Robinson (2016) have provided the foundations for modeling and crafting natural language models with the tidytext package. In this talk, we'll describe how we can build scalable pipelines within this framework to prototype text mining and neural embedding models in R, and then deploy them on Spark clusters using the sparklyr and the RevoScaleR packages. To describe the utility of this framework we'll provide an example where we'll train a sequence to sequence neural attention model for summarizing git commits, pull request and their associated messages (Zaidi (2017), and then deploy them on Spark clusters where we will then be able to do efficient network analysis on the neural embeddings with a sparklyr extension to GraphFrames.