Ali Zaidi

Data Scientist, Microsoft

Ali is a data scientist in the AI Research team at Microsoft. He spends his day trying to make distributed computing and machine learning in the cloud easier, more efficient, and more enjoyable for data scientists and developers alike.

SESSIONS

Nandeska? Say What? Learning, Visualizing and Understanding Multilingual Word Embeddings

What is the equivalent of the English phrase, "say what?" in Japanese? In this talk, we provide an intuitive approach to learning such distributed representations of phrases from multilingual data in a novel way using autoencoders and generative neural networks. Distributed representations of language are a very natural way to encode relationships between words and phrases. Such representations map discrete representations to continuous vectors, and frequently encode useful semantics of the linguistic units of the underlying language corpus, making them ubiquitous in NLP tasks. However, for most machine translation tasks, large amounts of parallel corpora are needed to learn semantic relationships between pairs of language-phrases, which can be problematic without aligned data. This talk will provide an examination of distributed representations using neural embeddings, with particular focus on the use of generative models and auto-encoders for learning shared word and phrase representations across languages. We show how we can speed up learning shared latent representations using Spark, and discuss techniques for optimizing phrase-alignment using active learning.

Scalable Bayesian Inference with Spark, SparkR, and Microsoft R Server

R has become the de facto language for statisticians. There are nearly 10,000 packages to choose from for statistical inference, visualization, and machine learning. However, the base CRAN implementation of R is burdened by numerous scalability challenges: it is single threaded and bounded by memory of a single node. In this talk, I will summarize some recent advancements in the R APIs for Spark, and show how they can be incorporated with Microsoft R Server on Spark to create a scalable machine learning platform. In particular, I will show how an R user can create functional pipelines for Spark DataFrames and RevoScaleR XDFs (external dataframes) to conduct Bayesian inference at scale, such as estimating cluster membership using Variational Consensus Monte Carlo in Gaussian mixture models, large scale topic modeling with stochastic variational inference, and finally, Bayesian estimation of Neural Networks with Stochastic Gradient Hamiltonian Monte Carlo. All examples will be developed entirely in R, and I'll describe best practices for performance and reproducibility.

Natural Language Processing with CNTK and Apache Spark

Apache Spark provides an elegant API for developing machine learning pipelines that can be deployed seamlessly in production. However, one of the most intriguing and performant family of algorithms – deep learning – remains difficult for many groups to deploy in production, both because of the need for tremendous compute resources and also because of the inherent difficulty in tuning and configuring. In this session, you'll discover how to deploy the Microsoft Cognitive Toolkit (CNTK) inside of Spark clusters on the Azure cloud platform. Learn about the key considerations for administering GPU-enabled Spark clusters, configuring such workloads for maximum performance, and techniques for distributed hyperparameter optimization. You'll also see a real-world example of training distributed deep learning learning algorithms for speech recognition and natural language processing.Microsoft Cognitive Toolkit (CNTK) inside of Spark clusters on the Azure cloud platform. We'll discuss the key considerations for administering GPU-enabled Spark clusters, configuring such workloads for maximum performance, and techniques for distributed hyperparameter optimization. We'll illustrate a real-world example of training distributed deep learning learning algorithms for speech recognition and natural language processing. Session hashtag: #SFds13

Extending the R API for Spark with sparklyr and Microsoft R Server

There's a growing number of data scientists that use R as their primary language. While the SparkR API has made tremendous progress since release 1.6, with major advancements in Apache Spark 2.0 and 2.1, it can be difficult for traditional R programmers to embrace the Spark ecosystem. In this session, Zaidi will discuss the sparklyr package, which is a feature-rich and tidy interface for data science with Spark, and will show how it can be coupled with Microsoft R Server and extended with it's lower-level API to become a full, first-class citizen of Spark. Learn how easy it is to go from single-threaded, memory-bound R functions to multi-threaded, multi-node, out-of-memory applications that can be deployed in a distributed cluster environment with minimal amount of code changes. You'll also get best practices for reproducibility and performance by looking at a real-world case study of default risk classification and prediction entirely through R and Spark. Session hashtag: #SFeco1

Deep Natural Language Processing with R and Apache Spark

Neural embeddings (Bengio et al. (2003), Olah (2014)) aim to map words, tokens, and general compositions of text to vector spaces, which makes them amenable for modeling, visualization, and inference. In this talk, we describe how to use neural embeddings of natural and programming languages using R and Spark. In particular, we'll see how the combination of a distributed computing paradigm in Spark with the interactive programming and visualization capabilities in R can make exploration and inference of natural language processing models easy and efficient. Building upon the tidy data principles formalized and efficiently crafted in Wickham (2014), Silge and Robinson (2016) have provided the foundations for modeling and crafting natural language models with the tidytext package. In this talk, we'll describe how we can build scalable pipelines within this framework to prototype text mining and neural embedding models in R, and then deploy them on Spark clusters using the sparklyr and the RevoScaleR packages. To describe the utility of this framework we'll provide an example where we'll train a sequence to sequence neural attention model for summarizing git commits, pull request and their associated messages (Zaidi (2017)), and then deploy them on Spark clusters where we will then be able to do efficient network analysis on the neural embeddings with a sparklyr extension to GraphFrames. References Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. "A Neural Probabilistic Language Model." J. Mach. Learn. Res. 3 (March). JMLR.org: 1137-55. http://dl.acm.org/citation.cfm?id=944919.944966. Olah, Christopher. 2014. "Deep Learning, NLP, and Representations." https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/. Silge, Julia, and David Robinson. 2016. "Tidytext: Text Mining and Analysis Using Tidy Data Principles in R." JOSS 1 (3). The Open Journal. doi:10.21105/joss.00037. Wickham, Hadley. 2014. "Tidy Data." Journal of Statistical Software 59 (1): 1-23. doi:10.18637/jss.v059.i10. Zaidi, Ali. 2017. "Summarizing Git Commits and Github Pull Requests Using Sequence to Sequence Neural Attention Models." CS224N: Final Project, Session hashtag: EUds2