The advent of pre-trained language models such as Google’s BERT promises a high performance transfer learning (HPTL) paradigm for many natural language understanding tasks. One such task is email classification. Given the complexity of content and context of sales engagement, lack of standardized large corpus and benchmarks, limited labeled examples and heterogenous context of intent, this real-world use case poses both a challenge and an opportunity for adopting an HPTL approach. This talk presents an experimental investigation to evaluate transfer learning with pre-trained language models and embeddings for classifying sales engagement emails arising from digital sales engagement platforms (e.g., Outreach.io).
We will present our findings on evaluating BERT, ELMo, Flair and GloVe embeddings with both feature-based and fine-tuning based transfer learning implementation strategies and their scalability on a GPU cluster with progressively increasing number of labeled samples. Databricks’ MLFlow was used to track hundreds of experiments with different parameters, metrics and models (tensorflow, pytorch etc.). While in this talk we focus on email classification task, the approach described is generic and can be used to evaluate applicability of HPTL to other machine learnings tasks. We hope our findings will help practitioners better understand capabilities and limitations of transfer learning and how to implement transfer learning at scale with Databricks for their scenarios.
Yong Liu is a Principal Data Scientist at Outreach.io, working on machine learning, NLP and data science solution to solve problems arising from the sales engagement platform. Previously, he was with Maana Inc. and Microsoft. Prior to joining Microsoft, he was a Principal Investigator and Senior Research Scientist at the National Center for Supercomputing Applications (NCSA), where he led R&D projects funded by National Science Foundation and Microsoft Research. Yong holds a PhD from the University of Illinois at Urbana-Champaign.
Corey Zumar is a software engineer at Databricks, where he’s working on machine learning infrastructure and APIs for model management and production deployment. Corey is also an active contributor to MLflow. He holds a master’s degree in computer science from UC Berkeley. At UC Berkeley’s RISELab, he was one of the lead developers of Clipper, an open source project and research effort focused on high-performance model serving.