This talk will cover how we built and productionized automated machine learning pipelines at Salesforce. Starting with heuristics to automated retraining using technologies including but not limited to Scala, Python, Apache Spark, Docker, Sagemaker for training, and serving. We will walk through the generally applicable data prep, feature engineering, training, evaluation/comparisons, and continuous model training including data feedback loops in containerized environments with Sagemaker. We will talk about our deployment and validation approach. Finally, we’ll draw lessons from iteratively building an enterprise ML product. Attendees will learn about the mental models for building end to end prod ML pipelines and GA ready products.
Speaker: Aditya Sakhuja
– Hi everyone. My name is Aditya. I’m Engineering Lead at Salesforce, working with an awesome team of scientists and engineers. Today, we’re going to talk about our journey to automated training at scale for the recommended system. Let’s look at the agenda. We are going to talk about the goal then the scenario, which was a motivation for building a system. We’ll talk about our approach and the metrics we kept in mind. Then we will deep dive into the system architecture, talking about the feature engineering, model training and the serving aspect of the system. Next we’ll talk about the evolution, how we transitioned from a simplistic system to a more advanced, robust system. Towards the end, we will touch on our deployment strategy, how we also do rollbacks for machine learning models. And finally, we will talk about the challenges and our takeaways. All right. So what’s the goal of the system? The business goal is to provide aid and assistance in providing solutions for the customer problems. And what’s the current scenario. Well, agents rely on traditional search results for finding the relevant answers to the questions customers have. And these questions are generally very long and time sensitive. So what’s the approach we are taking to tackle the scenario we just talked about? Well, we have a recommender system with two layers candidate generation and the ranking layer. The candidate generation takes in a large corpus of knowledge articles, which are provided by the customer, organization. And we figure it out and generate a smaller set of articles, which can be used by the ranking layer to provide the most relevant ranked results back to the agent. And that in turn is assisting the agent to provide an answer back to the customer. So the business metrics are important here as well, along with the approach. The agent time to resolution is a key business metric. We want the case to be resolved as quickly as possible by the agent. And that is where the knowledge articles recommendations are helping the customer. So we want to track this business metric. At the same time, you also want to ensure that if there is no resolution, we do not want the agent to be spending too much time on the case. A scenario could be where the case is delegated to the next tier. And that would be the target for the current agent. Well, in that case also, we want to make sure the time spent there is less. Thirdly, we want to track the attach rate. So the case and article, they are kind of like the question and the answer. And we want to ensure that if the recommendation is relevant, then the attach rate should also go up. Well, it’s not always true but it’s still a very good measure for us to track. Besides these business metrics, we want to also track the recommendations served, the count of the recommendation served. That is more of a signal of what is the scale of the system. Then the Monthly Active Orgs and the Monthly Active Users. That is another key metrics. And finally, the serving latency that gives a sense of how quickly we are responding back as a system, as ML system specifically. Let’s deep dive into the system architecture. So let’s go one level deep inside the two layers we just talked about. The candidate generation and the ranking model. In the candidate generation, the first step we do is convert the complex-long user question into a more meaningful formulated query, which we can send to our candidate generator. Right? For us, the candidate generator is a search system. It’s a very complex system in itself and it can be a very long talk on how we generate the candidates there. But we won’t deep dive into that for now. For now it’s good to understand that we extract the key terms, the nouns, the part of speech, which gives a good sense of what the intention of the question is. And then we formulate multiple queries for the IR or the search system, and get back a gap number of results back. And that is what we call as the generated candidates. Once that happens, then we go onto the ranking layer. The ranking layer has precomputed features, which we have offline jobs for. So we have precomputed features like the document frequency, so a lot of features around the document, a lot of features around the incoming query, from the past. So we have those kind of features already sitting in a data store. We use that, and then we do feature generation for every candidate with the question. So we have like a pairwise feature generation. Once that happens, then we pass it through our model, which has been trained. And that model gives back a score, which is used for ranking the results. And once we rank those results, we can pick the top tier, by the score. All right. So now let’s look at the serving workflow. On the bottom right corner you can see the service agent, that is the entry point to our workflow system. The case is created by the service agent on behalf of the customer. The case is representing the user question. Once that case is created it is fanned out and we have a message [murmur] to scale out the whole system, because there’re multiple cases getting created [murmur] by multiple agents. So we want to make sure the system is scalable. So we have a queue here. The handler on the other side of the queue picks up every case. And then invokes our recommendation system. As we talked earlier, there two parts to it, the candidate generation and the ranking. The candidate generation, you can see the arrow number four here is picking up the results back from the index, which is the search index. It’s a huge system in itself, different topic altogether. We can deep dive some other time. So we get the candidate generation happening in step four. After that, the shortlisted candidates are sent to the ranking layer. And as we mentioned earlier, the ranking layer will rank the candidates based on the features which were computed earlier. And finally in step seven, the recommendations are sent back with the service agent. Well, there are two more personas here. As you can see we have the org admin and the knowledge base admin, which are also part of the customer organization. So the knowledge base admin is responsible for maintaining the knowledge base, creating the knowledge base, updating it, creating knowledge articles in different languages. So those kind of things, are triggered to the knowledge base admin. So that is happening all synchronously and it gets index into the search index offline. Coming to the org admin. The org admin is the one who is controlling the set of flow of our system of the recommendation system. They get to select the fields which are important for the customer data. So customer data for Salesforce is very complex in a way that the schemer is very flexible and customizable. So it’s important for the org admin to specify what columns, what projections are important to the customer. That is what this data setupUI is for. And then there’re bunch of other UIs, I’ve highlighted the metricsUI here, because that is something which shows back to the customer, how well their data and the model are doing. So if their data’s chip is not perfect for the training of the model they will get a sense of that here at the same time if their model performance is not up to the mark they will also get a sense of that here. And then finally talking about the Salesforce internal persona. Those are maybe support engineers internally or even engineers on the team who can jump in and help troubleshoot some customer investigations. So now let’s talk about the other side of a system, which is the offline processing system, which includes data prep, feature engineering, and the action model training. In the data prep and feature engineering, the first step is to ingest the data from the system of record into a data lake to say where the data will be processed to bring it to a shape where it can be used for modern training. Right? So we do a bunch of cleansing and sanity checks on the data set which is once it’s ingested, we precompute a certain statistics around the data corpus. And then the feature engineering aspect will generate the feature vectors, which will be used by the model training. So we have 100 plus NLP features which are generated using across of features from the article and the incoming question or the case. We do a feature crossing and we have 100 plus features generated and they are categorized under six to seven statistical feature categories. So the feature categories are representing how relevant, ultimately they are pushing the model towards signaling the model in a way to say that how relevant this case is or rather the article is to the incoming case or the question. Well, we’ll deep dive into that a bit more. And then finally the serving and the training drift. I just put it out here to highlight that that’s something which happens quite a bit. If one is not careful in making sure your libraries are shared between the serving and the training stack. And if your data is basically drifting over time. So those kinds of things can happen. So just calling that out here in the feature engineering state. The model training itself. So we have a ranking model, which is auto tuned. The hyperparameter’s auto tuned using cross validation and grid search. And we also have a auto model comparison logic which helps us pick up the best model between the currently serving model and the newly drain model for a particular customer or particular organization. And this happens for every organization automatically. So that is where our strength lies, where we have built a system which is automatic now in terms of training, doing model comparisons, auto tune. So now the interaction of the data scientists, when the system is running is minimum. They are looking at the next best thing now. And there is minimal hand-holding required. Some of the key metrics are, of course the area under the curve. We do both the AUC and the PR. And then we have the F-measure, precision recall and finding the hit rate at K. So we look at the accuracy at the top K for all the cases. All right. So here talking about the training pipeline itself. To the left you can see the Salesforce average has, which is representing odds of all shapes and sizes, that the data from the Salesforce app is ingested into the data lake. At which point we can start the actual feature engineering and the model solving. And of course, the model training before that. So the entities we care about here are the cases, articles, and the attachments themselves. Then we go through, once the data is there in the data lake, it goes through typical stages of preparation and precomputation. Then we have the feature engineering where we do the actual transformation of the data columns into more model understandable, trainable attributes. We then do feature crossing and generate the features as we talked about earlier into different categories, into different statistical feature categories. Finally, we do some kind of feature selection to make sure we are not including features which are not adding value towards the training itself. In the training stage, you can see the feature weights are learned, and we do the validation and comparison as we were hinting towards earlier. Finally, we get a winning model and that is pushed back to the app. The agents and the admin interact with the app and get the results which they can send back to the customer. Another key part here is the model retraining. You can see here, the retraining is drawn. The arrow is very simplistic, but a lot is happening here. We have an automatic retraining cycle, which happens periodically, and it also can be triggered by the customer if they change the data [murmur]. All right. So now we will cover the system evaluation. Now that we know the system architecture, talking about the training and the serving side, let’s look at how we evolve from a very simple system to what we have right now. In version zero, we started with a rule-based system. We didn’t really think about printing a model on the first day, we wanted to first showcase that the business goals can be met by what we are trying to build. And so we went with heuristic rule-based system and integrated with our Salesforce app. And that is how it started. We signed up the first pilot and the pilot was our kind of partner in a way, right? So we learned from the pilots knowing about what their use cases are and that could be generalized. Our first use case, however, was more targeted towards a collaborative user space, the communities specifically, but the questions coming in the community are no different than questions which will come in a service set up. So the question, the length of the question, the nature of the questions could be different. But I think at a technical level, the problem has a lot of common aspects and we use bestAnswer as a positive label when we were looking in the communities days, eventually we went on to train a generalized model based on open dataset. And then in version one, we had the glimpses of our model which we are working on building upon now. So we had a ranking model, as we talked about earlier, it was trained using offline notebooks and it was done on demand. It wasn’t automated in the first cut. Also, we had static data set. Right? The customer’s data for Salesforce is very dynamic and it can be configurable but we didn’t start that way. We let the customer only specify the entities. We didn’t have the facility for them to specify which fields, what projection, what selections, what filter criterias would be applicable. So we didn’t have that in the beginning. Eventually we had to do it because the customers have very specific requirements and the way they manage the data is very different. It’s very unique. So we had to incorporate a setup flow where they can specify the selection and the projection on the dataset. The next few bullet points are key milestones actually. So we invested heavily on retraining so that we can keep improving the model quality as new data comes in. That is to tackle the data drift to say that the data keeps evolving over time and we want to keep up with the quality of the data. Then we had multilingual support to expand to our European customers and expand beyond English. And of course the auto-trained pipeline which I’ve been referencing quite a bit. That is a key highlight of our system as well, where we have minimal involvement of engineers and scientists now to run in production. Finally, observability and the deployment and rollbacks were our key basically to have a robust system and we invested heavily in that and that’s in production now as well. Okay. So looking a bit more into our model deployment, continuous integration, and the rollback strategy, you can see on the top floor, right? The top flow chart, the developer or the scientist basically they are the ones who are continuously upgrading the training code and adding new features in that. So once that happens, it goes through a typical continuous integration build cycle. The code is built together, and then that gets bundled inside a training image that gets pushed into the container registry, right? So that is no surprises there. That is something which we invested in and now we have a stable system where this happens in a very consistent fashion. Once that happens the deployment cycle has a rollback mechanism, so the DevOps person will kickoff the deployment through by updating the image tags for the staging and the deployment targets. And once that happens, the images are pulled from the container registry, which was updated earlier, that gets deployed to the test environment and then the production environment, if the test environment is successful and if it is not successful, then we basically can easily rollback, and which is an option in our workflow system where they can pretty much say rollback and the previous image is redeployed on the target tag. Okay. So looking a bit into the container itself. This is something which is more in lines with making the container independent of the cloud itself and we can host it anywhere. So we host it in a managed training service. It takes in some [murmurs] parameters which I needed for training the hyperparameters themselves, the training dataset for model training and then some static configs which are used for housekeeping inside the container. Once the training happens, we get the model. It is pushed towards a storage bucket on the cloud and then that’s exposed through a model API and the serving system picks it up from there. All right. So before we wrap up, let’s talk about the challenges and the takeaways. The first challenge, the category is around data, namely the privacy and the sharing compliances, handling encrypted data at rest and in motion, data freshness, for which we have to build a hydration pipeline, make sure the data is always up to date and models as a result are also up to date. Tackling too sparse data, too dense data. So if the data is too sparse, it might not meet the requirements for the model to be trained successfully. So we have a backup of using a global model which lets the customer get started and it is also in lines with the cold start problem I talk about here at the bottom. Another key highlight I want to make is the custom and the non standard fields, right? So that is something that you need Salesforce as also being a very complex platform, letting the customers customize their fields. We have to also keep that into account and make sure that the training pipeline can handle all corner cases around custom fields. Building the ML infrastructure of course, along the way, was a challenge. We have to invest in that and also learn along the way. And of course training serving skew, this is a common problem. If by default the scan, this is an easy trap to fall in. If the libraries are not shared or if there is a feedback loop, for example, from the model back to the training algorithm, this is something which can easily happen. So this is something I would say is a good learning to watch out for as well. And then on the same lines, the takeaways, I would say start small, ship and iterate. Prioritize your ML infrastructure from day one so that your model is, I mean, you can train a models in a very consistent fashion and not just one-off. And then start with simple interpretable model that is something which will help you debug and also the time to resolution to issues customer issues in the beginning is very critical as you’re onboarding pilots. So I would say start with simple integretable model and also keep into account the size of your data because that will also kind of influence the type of model architecture you are choosing. Finally, I would say prioritize your observability and data privacy. That’s something we talked about earlier. This is very important. So always prioritize your data privacy over model quality. And yeah, invest in your infrastructure overall. Perfect. Thank you everyone.
Aditya Sakhuja is an Engineering Lead at Salesforce Einstein building ML products. He built the early prototype of a question answering system in salesforce's ML journey and helped ship multiple ML products over the next few years in the service and collaboration space including article recommendations systems, case classification and working with external partners like Google. Earlier on he focused on pre-production performance and scalability analysis for distributed systems like msg queues and search gaining expertise around building and measuring low latency highly scalable systems. He has extensive knowledge in web/enterprise Search Systems, production deployed ML products along with offline and streaming data and serving systems. He got his Masters in CS from Georgia Institute of Technology and BS in Computer Engineering from the University of Pune, India.