Sherlock Huang

Principal Software Engineer, Microsoft

Past sessions

Summit Europe 2020 Accelerated Training of Transformer Models

November 18, 2020 04:00 PM PT

Language models help in automating a wide range of natural language processing (NLP) tasks such as speech recognition, machine translation, text summarization and more. Transformer architecture was introduced a few years back and it has significantly changed the NLP landscape since then. Transformer based models are getting bigger and better to improve the state of the art on language understanding and generation tasks.

With that, comes the demand not only to scale model training to billions of parameters but also do it faster without compromising on the accuracy of the trained model. We are part of Azure AI Frameworks group that has extended the open source framework ONNX Runtime (ORT) to accelerate training of very large transformer models and partnered with many internal and external customers to use ORT as the backend for their implementations in PyTorch or Tensorflow to pre-train and fine-tune their language models.

These models are used in production in a variety of Microsoft products like Office (for suggested replies, inside look etc.), Visual Studio (for code completion), Bing (for advertising) etc. We have released public GitHub repos with pre-training and fine-tuning recipes involving ORT acceleration covering models like BERT, GPT-2 and Microsoft Turing. In this talk we will talk about the memory usage and execution optimizations available in ORT and how that helped accelerate model training implementations when ORT was used as the backend to PyTorch or TensorFlow providing up to 45% speed up. We will briefly look at the best practices and innovations on model training, like mixed precision training, distributed training parallelism modes, gradient checkpointing, AdaSum, DeepSpeed ZeRO, etc., that are natively supported in ORT.

We will also discuss the details on the changes needed in the training code to enable ORT acceleration in PyTorch implementations so that you can leverage ORT for your large scale Transformer model training.

Speakers: Kaarthik Sivashanmugam and Sherlock Huang