Kaarthik Sivashanmugam

Principal Engineering Manager, Microsoft

Kaarthik works in the AI Platform group at Microsoft. He has expertise in building data and machine learning platforms. In his current role, he is building a distributed deep learning platform to unlock the full potential of GPU cloud, data and machine learning techniques in addressing complex AI challenges and enabling magical end-user experiences in various Microsoft services. Kaarthik is also involved in making Azure Machine Learning service the best cloud-platform for data scientists and ML engineers.

Past sessions

Summit Europe 2020 Accelerated Training of Transformer Models

November 18, 2020 04:00 PM PT

Language models help in automating a wide range of natural language processing (NLP) tasks such as speech recognition, machine translation, text summarization and more. Transformer architecture was introduced a few years back and it has significantly changed the NLP landscape since then. Transformer based models are getting bigger and better to improve the state of the art on language understanding and generation tasks.

With that, comes the demand not only to scale model training to billions of parameters but also do it faster without compromising on the accuracy of the trained model. We are part of Azure AI Frameworks group that has extended the open source framework ONNX Runtime (ORT) to accelerate training of very large transformer models and partnered with many internal and external customers to use ORT as the backend for their implementations in PyTorch or Tensorflow to pre-train and fine-tune their language models.

These models are used in production in a variety of Microsoft products like Office (for suggested replies, inside look etc.), Visual Studio (for code completion), Bing (for advertising) etc. We have released public GitHub repos with pre-training and fine-tuning recipes involving ORT acceleration covering models like BERT, GPT-2 and Microsoft Turing. In this talk we will talk about the memory usage and execution optimizations available in ORT and how that helped accelerate model training implementations when ORT was used as the backend to PyTorch or TensorFlow providing up to 45% speed up. We will briefly look at the best practices and innovations on model training, like mixed precision training, distributed training parallelism modes, gradient checkpointing, AdaSum, DeepSpeed ZeRO, etc., that are natively supported in ORT.

We will also discuss the details on the changes needed in the training code to enable ORT acceleration in PyTorch implementations so that you can leverage ORT for your large scale Transformer model training.

Speakers: Kaarthik Sivashanmugam and Sherlock Huang

Summit 2019 Infrastructure for Deep Learning in Apache Spark

April 24, 2019 05:00 PM PT

In machine learning projects, the preparation of large datasets is a key phase which can be complex and expensive. It was traditionally done by data engineers before the handover to data scientists or ML engineers. They operated in different environments due to the differences in the tools, frameworks and runtimes required in each phase. Spark's support for different types of workloads brought data engineering closer to the downstream activities like machine learning that depended on the data.

Unifying data acquisition, preprocessing, training models and batch inferencing under a single platform enabled by Spark not only provided seamless experience between different phases and helped accelerate the end-to-end ML lifecycle but also lowered the TCO in the building, managing the infrastructure to cover different phases. With that, the needs of a shared infrastructure expanded to include specialized hardware like GPUs and support deep learning workloads as well.Spark can effectively make use of such infrastructure as it integrates with popular deep learning frameworks and supports acceleration of deep learning jobs using GPUs.

In this talk, we share learnings and experiences in supporting different types of workloads in shared clusters equipped for doing deep learning as well as data engineering.

We will cover the following topics:

  • Considerations for sharing the infrastructure for big data and deep learning in Spark
  • Deep learning in Spark in clusters with and without GPUs
  • Differences between distributed data processing and distributed machine learning
  • Multitenancy and isolation in shared infrastructure