Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS Sagemaker for Enterprise AI Scenarios

Download Slides

Transformer-based pretrained language models such as BERT, XLNet, Roberta and Albert significantly advance the state-of-the-art of NLP and open doors for solving practical business problems with high performance transfer learning. However, operationalizing these models with production-quality continuous integration/ delivery (CI/CD) end-to-end pipelines that cover the full machine learning life cycle stages of train, test, deploy and serve while managing associated data and code repositories is still a challenging task. In this presentation, we will demonstrate how we use MLflow and AWS Sagemaker to productionize deep transformer-based NLP models for guided sales engagement scenarios at the leading sales engagement platform,

We will share our experiences and lessons learned in the following areas:

  1. A publishing/consuming framework to effectively manage and coordinate data, models and artifacts (e.g., vocabulary file) at different machine learning stages
  2. A new MLflow model flavor that supports deep transformer models for logging and loading the models at different stages
  3. A design pattern to decouple model logic from deployment configurations and model customizations for a production scenario using MLProject entry points: train, test, wrap, deploy.
  4. A CI/CD pipeline that provides continuous integration and delivery of models into a Sagemaker endpoint to serve the production usage

We hope our experiences will be of great interest to a broad business community who are actively working on enterprise AI scenarios and digital transformation.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Welcome to our session. My name is Yong Liu. Together with my colleague Andrew Brooks, we are very excited to share our experience developing continuous delivery of deep, transformer-based NLP models using MLflow and AWS Sagemaker for enterprise AI scenarios.

So here is our presentation outline with four sections. Let’s get started with the first section with some introduction and background.

So you may or may not have heard of Outreach, the company we work for. Outreach is the number one sales engagement platform hiring more than 4000 customers and growing. These customers include many well known start ups and multinational companies. So what is a sales engagement platform?

Sales Engagement Platform (SEP)

A sales engagement platform encodes and automates sales activities such as emails, phone calls, and the meetings, et cetera into workflows. For example, this diagram on the left shows a workflow where you can send an email and place a phone call on a day one and then schedule a LinkedIn message on day three and then send another email and make another phone call on day five. Due to such automation, sales reps’ performance and efficiencies can be dramatically improved up to 10 times better by doing more effective one-on-one personalized outreach.

In addition to automation, we are also adding intelligence into the sales engagement platform. That’s where machine learning NLP and AI come to play.

ML/NLP/AI Roles in Enterprise Sales Scenarios

So how does machine learning NLP help? By using our product, sales reps generate a lot of data such as emails, call scripts, and engagement logs, et cetera. We then leverage machine learning NLP to perform continuous learning from this data and combine with knowledge to provide prediction, recommendation, and guidance for the continuous success of reps. This becomes a (indistinct) wheel shown on the left. The reason for continuous learning is that sales process changes due to various reasons. We already observed that COVID-19 changes how sales reps using content, and that’s why it is important to enable continuous learning and guidance. One particular use case we will highlight in this talk will be guided engagement, which we will discuss later. However, before we really enjoy the benefits of machine learning NLP and AI, we have some barriers to overcome. Now I’ll handoff to Andrew to continue the discussion. – So now to set the stage and motivate what we’ve implemented and why, we’ll discuss some of the implementation challenges we’ve faced here at Outreach.

Implementation Challenges: the Digital Divide Outline

While we’re on this discussion within the context of our experience, we expect many of these challenges to be shared across other enterprise machine learning teams.

Challenge 1: Dev-Prod Divide

So first, challenge one, the Dev-Prod divide. If you ever felt like a model developer throwing a model over the wall to an ops team, and that’s the last you’ve seen of it or an ops team catching a black box model with unclear requirements and interfaces, this cartoon might speak to you. When model developers are isolated, when they can’t see or use code from the prod environment, they can’t test on live data the data streams that actually feed the model in the product or the application. Why is this a problem? So this often leads to mis-specified model pipelines which produce bad predictions which in turn produce complaints from users or even paying customers. And that complain is compounded even more when model developers can’t reproduce those reported bugs or issues. Remediation could involve manually diffing a dev and a prod pipeline to understand route cause. We’ve been there. It’s costly, it’s inefficient, it’s not fun, and it’s not necessary. Lastly, it’s also wasteful when production grade machine learning tooling that’s developed in prod cannot be used for future model deployment, sorry, model development. This is a scenario where a V1 model has been trained, correctionized, and shipped, but V2 model training is restricted to the early iterations of notebook and ad hock code used to develop V1 model.

Challenge 2: Dev-Prod Differences

Challenge two, dev-prod differences. These are the scenarios where differences between dev and prod pipeline are inevitable or sometimes desirable, the good differences. One common difference is that the data sources for model training and model scoring and production are often different. Data use for training is typically from a persistent data store, analytics data used internally within an organization to build models and reports without the danger of directly modifying customer facing data or putting load on correction applications. Prod data for scoring is often streamed, not static, the data that is customer facing, and might have been off limits to model developers during training. Often these data sources require different pre processing pipelines. Second difference is the inclusion of product specific or business logic that desired for prod, model scoring, but not during training. For example, the scoring pipeline in prod might want to suppress predictions where the model is not confident, but no such filters are desirable or even existent for the training pipeline.

Challenge 3: Arbitrary Uniqueness

So challenge three, arbitrary uniqueness. Without a framework codifying common design patterns, components have a tendency to be individually great and powerful, but collectively suboptimal and even counterproductive when connected to other components in a pipeline or system. This is a scenario where the whole is not greater than the sum of its parts despite the individual uniqueness and greatness of those parts. This is probably occurring when deploying each new model feels like a special case and reinventing the wheel for patterns that, for components that mostly kind of exist already.

Not only does this involve a lot of extra development, but it often produces pipelines that are not self documenting. For example, if gates and deploy mechanisms of a pipeline are not consistently defined, it’s unclear how to even run the pipeline. Ability for reuse across projects and models and integration within a bigger system is limited. Naturally, pipeline maintenance and extension is painful and inefficient, even more so when onboarding new engineers or developers.

Challenge 4: Provenance

So last challenge, challenge four, provenance, specifically provenance for models to source code and source data. Why do we need this? If we don’t know what’s running in prod, we can’t reproduce issues and bugs reported by users, as we discussed in challenge one.

A second negative effect is that model pipeline changes might make teams grimace with fear rather than excitement for shipping improvements. This is often the case when we’re not confident that a mechanism exists to consistently determine exactly what’s running in prod, how it got there, how to reproduce it, or promote a new model to replace it. Lack of provenance can also compromise historical and temporal analysis that use model predictions. If released models aren’t version, this could compromise benchmarking or historical analysis that disguise real world behavior changes that are actually caused by just undocumented model changes.

SS Cycle Implementation

So given these pain points and challenges, we’ll discuss how we overcome some of them in the context of a real use case at Outreach. Afterwards, we’ll also share some of the challenges we continue to face and thoughts for addressing those in future work.

A Use Case: Guided Engagement

So the use case we’ll walk through is the Outreach guided engagement feature. It’s an inbox based intelligent experience powered by an intense classification model under the hood. When sales reps receive replies from their prospects, existing or future customers, Outreach predicts and displays the intent of that prospect’s email. Perhaps positive, the prospect is willing to meet. Or in this case, objection, the prospect already has a solution. Based on the predicted intent, relevant content is recommended to the sales rep. For simplicity, our talk will focus on just the intent prediction component, text classification, not the content recommendation component.

Six Stages

While we discuss our use case and pain points, we’ll reference where we are in the machine learning full model life cycle.

In this talk, we’ll focus on the middle for stages, starting with model dev. This is where we run many experiments offline to quickly iterate and develop the model object of the winning model that we want to shed. In pre prod, we mature and package that winning model logic into software that gets published for use in production. For our use case, this publishes a docker image and trained model artifacts. The last two stages we’ll talk about, model staging and model prod is where the model is hosted, exposing an endpoint for Outreach, our product application to call.

= Model Development and Offline Experimentation MLFlow tracking server to log all offline experiments

So starting with the model dev phase for our use case, this is where most of our development was in Databricks or Jupiter Notebooks in code repositories used only for offline developing model logic and running offline experiments. Even though we didn’t intend to ship this code, we did leverage MLflow tracking to tie experiments to results. This provenance prevented unnecessarily repeating the experiments multiple times and provided contacts and baselines for the winning model. Our model development often includes many modeling frameworks and techniques, each with different APIs. For this particular model, we explored SVMs using Scikit Learn, fastText, Flair. Ultimately we chose the huggingface Transformers library for its unified API to state of the art deep transformer architectures and pre training language models. State of the art is a quickly moving target in this domain, so a project with an active community quickly closing the gap between published research implementation was important to us.

Creating a transformer flavor model

While strong in momentum, the huggingface Transformers library is a relatively young project and not yet a native MLflow flavor. We avoided arbitrary uniqueness by extending MLflow and writing our own MLflow flavor that lets us plug in to the rest of the MLflow framework. So what does that mean? That means we wrote a tiny wrapper class, shown on the left side, that maps the huggingface Transformers library, which itself wraps a multitude of powerful architectures and models to the standard MLflow model’s API.

And so what does this get us? Among other things, this buys us a standard mechanism for model serialization, both saving and loading. We also wrote a transformer classifier class that’s Scikit Learn pipeline compatible. So we can chain our transform model with pre and post processing steps. And why do we need that? So we need this when we have scenarios like in challenge three where train and prod platforms need to be different because the data sources are different, or there’s different business logic desire to be in production but not training. One example in our scenario is filtering email auto replies from production scoring but not during model training.

And here we have just an example

Saving and Loading Iransformet Artifacts

of a saved transformer model from the MLflow tracking server API, just MLflow tracking server and its associated artifacts, shown on the left. The code snippet on the right shows just the couple lines to log or load the model. This bypasses the arbitrary uniqueness involved with manually dealing with Python’s pickle or other serialization protocols which can be finicky and pass the burdens to the consumers of the model.

Productionizing Code and Git Repos

In pre prod, where we’re intentionally rewriting code and refactoring it from the dev notebooks into software that will actually run in production. We adapted MLflow’s MLproject pattern. MLproject is fairly lightweight layer that centralizes entry points and standardizes their configurations in environment definition management. Again, a cheap way to avoid the pain associated with arbitrary uniqueness by providing a self documenting framework for the pipeline.

From a workflow perspective, we found the flexibility for MLproject to run remote code and execute on remote clusters actually also accelerated our development and tightened up some of our testing and code reviews, such as reproducing model results.

By referencing the codes to run on, by referencing the code to run by GitHub release tags or (indistinct) shown in red, we’re able to buy ourselves improved provenance, tying the source code directly to the model artifacts and results in MLflow tracking. From a workflow perspective, this also prevents the hassle of manually cloning and running local code.

Referencing the remote execution environments, shown in green, allowed us to develop code outside of Databrick’s notebooks in our IDEs of choice while also leveraging the power of the Databricks run time for execution and powerful GPU based clusters.

Models: trained, wrapped, private-wheeled E To support deployment specific logic and environment, we create three progressively

– Now, suppose we have a production grade trained NLP models for intent classification of emails. Does that mean we can deploy to production? Not so soon, actually. That’s because there aren’t avoidable differences in terms of logic in the deployment environment, which we have been discussing in our challenge part. That’s why we create three progressively evolved models for final deployment in a host environment. In our case, that’s the Sagemaker. First, we create a fine tuned trained transformer classifier, and then we wrapped the same classifier with preprocessed and post processed steps, which we call pre-score and post-score filters. And this entire wrapped pipeline becomes a circular pipeline. This is basically the pipeline shown in the middle of the diagram.

The reason we want to have some pre-score filter is because we want additional logic, such as whether to pass the email to get the current reply message body only or the full email thread. Or if the email says it’s too big, we may decide to not score it at all. Or maybe we want to use some cache mechanism when the email content is exactly the same. We can just return the cache results of the prediction.

So for the post processing part, post-score filters, we could also add additional model metadata in the response so that we can track provenance from the caller side. Note that there’s no model logic change in the classifier itself, but having this second model pipeline is much flexible. Lastly, in the production environment, we don’t want the model to access our private GitHub because accessing a private GitHub requires either a GitHub token or access key, which are security concerns in an enterprise production environment. So we create a third model, which packages all private Python dependencies into prod wheels and then burn them into the docker image so that at the deployment time, the model can reference them without accessing the private GitHub.

Continuous Integration through CircleCl

So now we are ready to fully automate the deployment through CICD tool. For CI part, which is the continuous delivery part, or continuous integration part, we use CircleCI. The CircleCI pipeline not only does unit testing and style enforcement, but also runs the entire chain, test, wrap, and the deploy all the way to a Sagemaker endpoint at staging. Using a subset of the training data, this safeguards any code checking. Note that several steps off CircleCI are reused in the same MLproject entry point we discussed earlier.

This also allows us to even break the dev-prod divide because we could also use the same CI pipeline to run some experimental code or model changes without reinventing wheels.

Now for CD prod, that’s continuous delivery and rollback, we use Concourse at Outreach. This pipeline has two well defined human gates. First, for model building, a designated person needs to kickoff the entire pipeline, and once it passes a regression test and make sure that we are not getting a worse model than previous one, then a second human gate will also need a person to promote the model to a production endpoint. Here, the last step we showed that we can deploy to (indistinct) Sagemaker US East and US West region.

Model Registry to Track Deployed Model Provenance

So in the CICD automation, we not only lock the model, but also register the model using MLflow model registry. From model registry, we can clearly see which version of the model is in production and which is in staging. And if you are curious about more details of the model, you can just click the version link, and you can find out the providence information of the model.

So now we have done the four life cycle implementation. How well did we address the four challenges we talked about at the beginning? We feel like we did pretty well on all four stages in terms of provenance tracking.

Except for the model dev stage, we also did well in overcoming other three challenges. In particular, we feel like we did very well in embracing the dev-prod divide and arbitrary uniqueness during model pre prod where we wrote the production code and the published transformer flavor model, making model code and the deployment process reusable and repeatable. However, one area we are not fully satisfied is during model staging, we did not really test with production real time streaming traffic, which could have been done through some AB testing mechanisms before we promoted the model to production. That’s something we will address in the future. So in conclusion, we highlight four typical enterprise AI implementation challenges and how we solved them with ML flow, Sagemaker, and CICD tools. Our intent classification model for guided engagement has been deployed in production and in operation using this framework.

Our next steps, in addition to what we mentioned testing staging for the model using AB testing kind of mechanism, we are also addressing the following things. First, incorporating model in-production feedback loop into annotation and model development cycle. Second, we are further improving the annotation pipeline to have seamless human-in-the-loop active learning and model validation. Finally, we would like to thank everyone in the data science group at Outreach who have contributed and supported this project. If you are interested in knowing more details about our experience and about Outreach platform, please contact us at this email address shown in the screen.

Thank you very much.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Yong Liu

Yong Liu is a Principal Data Scientist at, working on machine learning, NLP and data science solution to solve problems arising from the sales engagement platform. Previously, he was with Maana Inc. and Microsoft. Prior to joining Microsoft, he was a Principal Investigator and Senior Research Scientist at the National Center for Supercomputing Applications (NCSA), where he led R&D projects funded by National Science Foundation and Microsoft Research. Yong holds a PhD from the University of Illinois at Urbana-Champaign.

Andrew Brooks
About Andrew Brooks

Andrew is a Senior Data Scientist at where he focuses on developing and deploying NLP systems to provide intelligence and automation to sales workflows. Previously Andrew was a Data Scientist at Capital One working on speech recognition and NLP and Elder Research consulting in domains spanning government, fraud, housing, tech and film. Before discovering machine learning, Andrew was an aspiring Economist at the Federal Reserve Board forecasting macro trends in Emerging Markets. Andrew holds a MS in Mathematics and Statistics from Georgetown University and BS & BA in Economics and International Studies from American University.