Equipment maintenance log of the global fleet is traditionally maintained using legacy infrastructure and data models, which limit the ability to extract insights at scale. However, to impact the bottom line, it is critical to ingest and enrich global fleet data to generate data driven guidance for operations. The impact of such insights is projected to be millions of dollars per annum.
To this end, we leverage Databricks to perform machine learning at scale, including ingesting (structured and unstructured data) from legacy systems, and then sifting through millions of nonlinearly growing records to extract insights using NLP. The insights enable outlier identification, capacity planning, prioritization of cost reduction opportunities, and the discovery process for cross-functional teams.
Sumeet Trehan: Good afternoon, everyone. Welcome to my talk. In the next few minutes, I’m going to discuss how we do applied ML at scale to provide a world-class data-driven guidance to impact our day-to-day operations. I’m Sumeet Trehan. I lead and manage the engineering team behind this product. So what is this product in a nutshell? This product sifts through text-heavy unstructured data and aims to answer three fundamental questions for us: what happened, why it happened, and when it happened. And by the phrase, it, I’m alluding to an event such as an equipment failed during operations. So we want to understand what component of an equipment failed, what was the underlying reason why it failed, and last, what is the frequency of such a failure. Here’s my engineering team which comprises of applied ML scientists and ML engineers. Here’s the agenda for our talk. I’ll first talk about the business driver and then go into the technical details of how our system looks at model development, how our system looks during runtime, what is the architecture of our system.
And then also spend a few minutes to talk about data science, the NLP algorithm that we use under the hood to answer the three questions: what happened, when it happened and, why it happened. So let’s spend a few minutes to talk about the business driver. At ExxonMobil we operate a wide area of equipment across the globe. For example, on the left is a floating vessel that we operate in Latin America. And on the right, we have complex machinery that we operate in Asia. For every piece of equipment that you observe here, for example, the antenna on the floating vessel or the heli pad, we maintain a detailed service, or a maintenance log. The business driver for our product is “can we sift through the text-heavy maintenance and the service log and answer the question: what happened, when it happened, and why it happened? And by the phrase it, I’m alluding to an event such as failure of an equipment or an equipment component.
By answering those three fundamental questions, we can then put together a story that can help our end user drive insights towards what’s an outlier, does a specific piece of equipment fail faster than others. It can help our end user with capacity plan. It could help our end user with privatization of maintenance tasks. So using these insights we are able to impact the bottom line of our company. And it is projected that this product can help the corporation save millions of dollars on an annual basis. Now, before I talk about the architecture of the product, let me take a minute to talk about the challenges that we faced as we were developing this product. I classified those challenges into three broad categories. First, is the infrastructure. We have a legacy systems and infrastructure, which is where we store our data. So as we were developing the architecture for our product, we had to be cognizant of that fact that our data sits in our legacy system.
Second, as we were scaling up our prototype, and we were thinking about putting our machine learning models into production, we had to carefully think through how we can scale our models, how we can rescale our machine learning algorithms to operate at ExxonMobil scale.
Third is a data quality, which feeds directly into the complexity of our NLP algorithms. When we were doing exploratory data analysis, or EDA, we observed a big variability in our input data. For example, a service log for a given equipment was expressed very differently by two engineers. In other words, for a same equipment for same kind of servicing, it was expressed in our service logs very differently. Which means when we design our NLP algorithms, we had to do something so that that is taken into account.
Next, let’s talk about the solution. At a higher level we adjust our data from our legacy system into a cloud data warehouse. Once the data has landed in our data warehouse, we then do modeled serving and model training using Azure Subsume. And here is the architecture diagram. Now for the purpose of this slide I’m going to focus on the batch data. So our batch data, which sits in our legacy system, is first ingested into our cloud data warehouse, which is snowflake. That’s in the second column under storage. Now, once the data has landed in snowflake, then we shift into model training or model serving mode. For example, during model training, you’ll first read the data from snowflake using SQL.A Or if it’s a supervised learning, we load the data, which is labeled data, from Azure Blob Storage. We spin up a spark cluster. We train our model. Let’s say it’s a deep learning model.
We spin up a GPU cluster, we train a machine learning model, and then we use MLflow for model registry. So, in the next slide, as we talk about model development, what you will observe is that we use MLflow to compare different models or do the experimentation. Now, when it comes to model serving and in the inference phase, we load the model using the rest end point that we have, and then do an inference. And we write our findings, or the output, back to snowflake, which is shown on the very right of the slide deck. Now, when we talk about output, we are alluding to answers to those three questions that we posed earlier: what happened, when it happened, and why it happened. So we use NLP algorithm to help us answer those questions.
Next, let’s dive a little bit deeper into the architecture and see how our system looks like when it comes to model development, or ML pipeline development, or at runtime. So here’s how would the system looks like at model development phase. When applied ML scientist wants to experiment or develop a new model, they use the Azure Subsume workspace. Now within this workspace we use a Jupiter notebook to help us experiment with different models. Now to develop a model or experiment with different models, the first thing we do is we load the data, let’s say from snowflake if it’s unsupervised learning. Now in the context of a supervised learning, we’ll load the labeled data which sits in Azure Blob Storage. Once we have the data, the relevant input data, then we also pull the common details. Now what we’ve done in this product is bundled all the different models, the common utility, into a Python package.
So we load this Common Util package, along with our label data, and then spin up a spark cluster to do our model training. Now, the spark cluster might be a simple commodity machine, let’s say with 14 gig RAM on 20 machines. Now, once we have done the training, or we are in the process of experimenting with different models, we leverage MLflow to help model versioning, to help compare different models, to see which model is the best, and then save the model that we want to use for production. Common Utils, and before I move on to the next slide, let me take a minute to talk about those Common Utils. We bundled all our Common Utils in a python package. And what we observed is that it helps us enforce a schema. It helps us introduce standardization. It helped avoid boilerplate copy paste for an applied ML scientist.
It also abstracts away the IO of data, let’s say a data models, and other assets used by type or location or format. It abstracts away all those paying points from us, from an applied ML scientist’s perspective. And last, during the model development phase, this Common Util package it can be configured with the file, allowing us to swap in a local backend during the initial development or unit testing phase. Next, let’s talk about how our system looks from an ML pipeline development phase. We break down our product into a number of small, independent units of work, or building blocks. We call these building blocks as steps. So we can run these building blocks, or steps, in parallel, or we can run them in series, or we can run them independently depending on what output do we desire. We then use ADO pipeline to help us build these building blocks or steps.
So ADO pipeline helps us build and copy the binary distribution into the Azure Subsume File System or DBFS. Now in the next slide we will see how we use this in the runtime. During the runtime, all the individual steps or building blocks are now wrapped in ADF or Azure Data Factory activity and composed into nodes in a DAG, which we trigger daily. Activity nodes, or the building blocks, are Subsume jobs that are sent to a cluster, in our case it’s 20 commodity machines with 14 gig RAM. So at runtime, each job will pull data from various resources. For example, in this slide, if you focus on step two, that’s building block two, then it pulls the relevant input data from the ingestion zone off snowflake, which is in the third column from the very right. It pulls in the relevant data from the injection zone of snowflake.
It might also pull in the necessary labeled data from Azure Blob Storage, and then it would spin up a spark cluster to train a machine learning model, and then use that model to make a prediction. Or if in case of inference time, it will just load that model using the rest end point and make the prediction and write the prediction back to the enrichment zone of the snowflake. So if you have N different steps or N different building blocks, then each building block, or each step, or each node as we run creates its own table in the enrichment zone of snowflake. At the end, when we want to feed this data into the dashboard so that our end customer can look at the final results to do a SQL join of all the tables that we have populated in the enrichment zone of snowflake, and then feed that resulting table into our dashboard, which powers up the insights that our engineers, are the folks, are the boots on the ground use to make their decisions on a daily basis.
Next let’s switch gears and talk more about the data science. How does our NLP workflow look like? Which helps us answer the questions: what, when and why? So let’s take a look at our NLP workflow. Here’s an example of our workflow at a 50,000 feet view. We ingest the raw data from snowflake. That’s a cloud data warehouse as we saw in our architecture slides. Then, we do the cleanup using regular expressions. We do tokenization and feed those tokens into a Fast Text model. And once we have the embeddings from a Fast Text model, we take those embeddings, feed those into a classifier, which helps us understand and answer the question: what, when and why. Now, if you look at the first sample on the very left, which says the XYZ pump has failed, we anonymize it by the way, from that sentence we can clearly see that if we are talking about a pump.
So the expected output in this case should be the XYZ pump is a component, or is equipment ,that has failed. In the next slide and the slide after that, we’ll talk a little bit more into how we actually do this. This is how our workflow looks like during the training and the inference stage. For simplicity let’s start with an inference stage. During inference, we load our relevant input data and go directly to step two. In step two, we load the Fast Text model, and then we generate embeddings. Now, of course, I assume that we have done the cleanup and tokenization before we load it, before we ran it to our Fast Text model. So once we have the embeddings in step two, we then directly jump to step four where we load our classifier, our supervised machine learning model, to help us understand what failed given the input data. Depending on the prediction quality if we have enough confidence, then from step five, we can directly go to step six where we write that prediction back to snowflake.
However, if we did not have enough confidence in our prediction, then we directly go from step five to step seven, where we use an unsupervised linguistic model to help us answer the question: what failed. Now, this is how our workflow looks like during the inference phase. Let’s see how the workflow looks like during the training phase. So in the training phase, we first go to step one where we load the training, the labeled data, which sits in Azure Blob Storage by the way, into the memory. So once we have the labeled data, we clean up, we do tokenization, we go to directly to step two, where we have the Fast Text model, generate the embeddings. From embeddings we go to step three.
In step three, as we saw in the model development perspective how our system looks like, they’re going to spin up a spark cluster, train a model, readjust to the model using MLflow, and then leverage that model during the inference period. So this is how our workflow looks like during the training or the inference phase. The key takeaway is we use a hybrid model, a supervised or an unsupervised model, for our inference. And as I mentioned before, if we do not have enough confidence in our predictions, then we use an unsupervised model. So our Unsupervised Linguistic Model tends to act like a human. So when it looks at raw input data, for example, the one shown here would say “the TX on the P-1234 has failed.” Then it starts to look at this data and says, “aha, I have seen something similar in the past and I know TX is a shorthand for transmitter. At the same time, I’ve seen the pattern that P- is a shorthand for a pump.”
Next, our linguistic model looks at the sentence and says, “where’s my noun, words, prepositions, adjectives?” So once it has done all that, it is able to predict with reasonable confidence what failed based on this raw input data. So for example here, our linguistic model tells us that the transmitter on that pump has actually failed along with the motor because it’s a noun. So given the input text, our predictions will be pump, transmitter and the motor. Again, the key thing I want to highlight here is that we have both a supervised and an unsupervised model. And the idea is that over time, as our supervised model tends to become better and better, a lot of our sample predictions will actually come from supervised model.
Now during an inference phase, if you observe that a lot of our predictions are done by unsupervised model, then at some point in an offline stage we go back to this entire population, where the inference was done using unsupervised model. We randomly pick some samples from this population, where the inference was again done by an unsupervised model, take those samples, have a human look at it, do the labeling, and then use that label data to help us improve our supervised model again. So we have this kind of pollute which we operate along with an unsupervised model. To summarize, at ExxonMobil we operate a wide area of equipment. And for each equipment we maintain a service log, which is recorded as text in our legacy system. So in this product, we were able to come up with an architecture that can ingest that data from our legacy system, use Subsume workspace to develop prototype and develop machine learning algorithms, and then scale those algorithms to ExxonMobil operation scale.
We also leveraged MLflow to help us with experimentation and development of different models and compare different models. This product provides us insights that can help us with outlier identification, capacity planning, and prioritization of maintenance tasks. So with this product what we have observed is a clear line of sight from how we leverage our maintenance, or service logs, how we use those logs to answer those three fundamental questions: what happened, when it happened, and why it happened. And by using the answer to those three questions, using NLP, we were able to extract insights and those insights impact the decisions that we take on a day-to-day basis. And those decisions are projected to have an impact on our bottom line, in the amount of millions of dollars on an annual basis. Thank you very much for your time. Happy to take questions.
Sumeet has held various staff and leadership roles focusing on Applied ML at ExxonMobil. Previously, he was at Stanford University where his PhD focused on using Applied ML to solve real world energy ...