The ever-increasing interest around deep learning and neural networks has led to a vast increase in processing frameworks like TensorFlow and PyTorch. These libraries are built around the idea of a computational graph that models the dataflow of individual units. Because tensors are their basic computational unit, these frameworks can run efficiently on hardware accelerators (e.g. GPUs).Traditional machine learning (ML) such as linear regressions and decision trees in scikit-learn cannot currently be run on GPUs, missing out on the potential accelerations that deep learning and neural networks enjoy.

In this talk, we’ll show how you can use Hummingbird to achieve 1000x speedup in inferencing on GPUs by converting your traditional ML models to tensor-based models (PyTorch andTVM). https://github.com/microsoft/hummingbird

This talk is for intermediate audiences that use traditional machine learning and want to speedup the time it takes to perform inference with these models. After watching the talk, the audience should be able to use ~5 lines of code to convert their traditional models to tensor-based models to be able to try them out on GPUs.

Outline:

- Introduction of what ML inference is (and why it’s different than training)
- Motivation: Tensor-based DNN frameworks allow inference on GPU, but “traditional” ML frameworks do not
- Why “traditional” ML methods are important
- Introduction of what Hummingbirddoes and main benefits
- Deep dive on how traditional ML models are built
- Brief intro onhow Hummingbird converter works
- Example of how Hummingbird can convert a tree model into a tensor-based model
- Other models
- Demo
- Status
- Q&A

Karla Saur, Senior Research Software Development Engineer, Microsoft

Matteo Interlandi, Scientist, Microsoft

Karla Saur: Hi, I’m Karla. Today, along with Matteo, I’m going to talk to you about Hummingbird, an open source library from Gray Systems Lab at Microsoft. This was developed along with several collaborators from the open source communities and PyTorch and PBM. Hummingbird is a library for Machine Learning Prediction Serving, but first, I’ll go a little bit into an overview of what that means.

When you’re getting started with machine learning, you have a bunch of data. You want to perform some learning from training process, which will then output a model. Once you have the model, you’re ready to deploy that model to some sort of server and serve that to a bunch of users. The second part, where this prediction serving is happening, that’s the focus of Hummingbird. So this prediction serving and modeling serving will be what I’ll be referring to today, versus the initial phase, the training phase. When we got started looking at this model serving, we did a survey of what systems already exist in practice today.

There’s a lot of specialized systems that have been developed such as ONNX Runtime, TensorRT and TVM. And we noticed that a lot of these systems focus on Deep Learning or DL, and there wasn’t nearly as much in the space of traditional machine learning methods. It’s largely overlooked, so for example, scikit-learn, ML.NET and H2O.ai.

To talk a minute about what traditional machine learning is and why it’s important, I have this graph here from a Kaggel survey, and this survey asks a bunch of machine learning professionals and data scientists, what sort of models they use in their work these days, and a survey overview of the broader space of the model. The graph in green, you can see linear logistic regression, decision trees, random forest, gradient boost, and Bayesian models. These are all examples of traditional ML models.

And so they’re very prevalent everywhere, and very important that we make sure to optimize as much as possible. Another data point and Microsoft survey, we gathered a bunch of open source Python notebooks on GitHub and compared across two years, which libraries were the most imported. So the first few winners there are mostly data manipulation libraries and in fourth is scikit-learn, which is one of the main traditional ML model frameworks. It grew a lot between these two years and it continues to grow a lot today versus say TensorFlow, which is seventh or PyTorch, which is 12th, not pictured in the neural networks space. They’re still growing and still up and coming, but it’s very important that we are able to optimally support scikit-learn and traditional ML models.

So the problem is the lack of optimizations for traditional ML Serving. These systems for training ML models are not necessarily optimized for the serving part and this is because the traditional ML models are expressed using imperative code in an ad hoc way. So it’s very difficult to do it, even generalization in a way to optimize them using a shared logical abstraction and because of this, these traditional ML models cannot easily natively exploit hardware acceleration. So across a lot of these traditional ML frameworks pictured here, if I wanted them to run on a particular piece of hardware, I’d have to write specialized code for each of the operators, across each of the frameworks, for each hardware system. So it doesn’t really generalize well, or it doesn’t really extend or easily scale. To understand a little bit more about why, we’re going to take a quick look about how a traditional ML models look on the inside.

So on the left here, I have a data frame. I have two categorical features, A and C and then two numerical features, B and D. I’m going to feed them into my traditional ML model, which currently is a black box. The example here is of a binary classification and you can see on the right, I have a zero versus one. I expect that to be my output. Looking inside, we have a DAG directed acyclic graph of operators. You may have heard this called as a pipeline. On the left, there’s a bunch of featurizers which are going to do some processing of the data and on the right, we have the predictor, which is going to actually output the prediction.

Stepping through this for a second. First, what we’re going to do is split the incoming data into categorical features and numerical features, where the numerical features will go to the top node and be normalized with the scaler and the categorical features will be one hot encoded in a way that they can be easily processed in hardware. Then, we’ll concatenate these two vectors together, merge them back, feed that data into the final score to compute the final score, so logistic regression.

So this is just one very simple example of what a model might look like. Across all of the traditional ML frameworks, there’s hundreds and hundreds of these featurizers and predictors so it’s hard to represent them in a concrete and a way that makes sense across all of the different frameworks and all of the different options. We contrast this with deep learning. In deep learning, we primarily rely on the abstraction of tensors and tensors is basically just a multi-dimensional matrix. So if you look inside a deep learning model, deep learning models are also expressed as a DAG, but specifically focused on tensor operators. So here you can see, I have this very generic matrix multiplication, addition, those sort of mathematical operations that are easily represented across a wide variety of systems.

The people that built these systems for Deep Learning Prediction Serving are able to capitalize on this, the fact that we’re using tensor operations and exploit that abstraction to work across lots of different target environments. So at the top here, I have, for example, PyTorch, TensorFlow, just to name a few, and I’m able to take that and compile them down into this prediction serving framework.

From there, easily convert that to my target hardware, such as GPU, SPGA, TPU, that sort of framework. The benefit of this is tensors and the tensor operations are efficiently implemented so we have that as a huge plus, and it’s due to the fact that these tensor operations are declarative. It’s powerful enough to detach how you express the neural network from how you execute it, which allows the compiler to do some additional optimization.

This gives us seamless hardware acceleration. So I can just write my model in PyTorch and then target that to run on any target hardware. From this, we claim we reduce engineering efforts because we very simply are able to write once and run anywhere.

So putting this back into the broader picture, where I see at the top, the traditional ML frameworks that I talked about earlier, we’re not able to immediately run those. So for example, if I have my second learn model, I’m not able to run that natively for prediction serving on GPU. I have to do something additional and this is where Hummingbird comes in. Hummingbird is able to translate or compile these models into a tensor based framework, which allows us to leverage these existing tools in the deep learning space. Next, Matteo is going to tell you a little bit more detail about exactly how that works.

Matteo Interlan…: Thank you, Karla, and let me explain to you how Hummingbird [inaudible] works. So our observation is that pipelines are actually composed by two classes of operators. One, is algebraic operations. Think about like linear regression, where you basically are doing a bunch of metrics and multiplications and the second class of operation are the algorithmic. For this operations, think about RandomForest or OneHotEncoder, where you basically exploit, where you basically have to implement an algorithm for doing, for instance, three throwers. So as you can imagine, translating algebraic operations into Tensor is actually straightforward because we just used the Tensor Linear, a Linear Tensor section, to implement the metrics better operations. Regarding that grid, your operation, [inaudible] instead that you started more complex. This is because Algorithmic Operator, they basically implement arbitrary control-flow patterns and data access patterns.

So our solution to solve this specific problem is to try to introduce redundancies, but at the computational and storage level, such that we can easily target the attempts for obstruction. While we do this, we basically try to have different implementations for each of these single operations, such that we can target a specific implementation based on the statistic of the trained algorithm. So I imagine that this looks a little bit too eye level so let me go over an example with you. Here on the left, you can see a simple decision tree. In this decision tree, we have the four internal nodes and five leaf nodes, and we are basically using five input features to do inference over this decision tree. So let me explain how Hummingbird differs to the inference over this decision tree. So hummingbird uses four [inaudible 00:09:52] not our structure.

So the first two are the Metrics A and Vector B. So the Metrics A basically have one column for each of the internal nodes and one row for each of the features. Basically Metrics A is used it to include which feature is used by which decision not. So that means that if you look at the first row, basically means that the first decision note we’re using for feature number three, the second decision note we’re using for feature number two, so on and so forth. Then, the effect of B is just basically storing the three shows, one for each of the decision nodes. You have other two that are structured. Metrics C is used to basically encode the given note that the path in order to reach a leaf node. So we in C we basically have one column for each of the five leaf nodes in this example, and one row for each of the decision nodes.

If you look at the first column, basically, this is telling me that in order to reach leaf node one I have to go left. If the decision is not what I want and left the for decision number two. So one here basically says, go left. One minus one means go right. If you see the column number two instead of this is for leaf node two, basically it’s telling me that I have to go left one and right once. For the zero, basically means that we don’t use that specific node in order to do this traversal. Finally, the Vector D is basically incoding the path in order to reach a specific leaf node. So we have one column for each of the leaf nodes and the way in which we are coding the path here is summing up the number of one.

So in order to reach the leaf number one, the leaf node number one, we have to sum up the two ones. For leaf node number two, we only have one so on and so forth. So now that I introduce this data structure, let me go over the actual algorithm. So we start with metrics, vector multiplication between the input features and the metrics A that was encoding the way in which the decision to not use the input feature. The output is a vector that basically contains the specific position which the features are used. Then, we basically do a simple comparison between this vector with the feature with the [inaudible] three should that we store in the vector B. This is basically telling us we started the decision, not the directory trigger, given this input features.

In this case, it would be decision not three and four. Then we use this vector and we multiply it with the metric c to evaluate the path that reach the leaf nodes. We do a comparison between the path of that we generated by doing this metrics multiplication with the path that we computed to store into the vector number D. In this case, in this comparison, the only leaf note that you want is leaf note number four. That means that for the input feature that we use, the leaf number four is the final result. Of course, I explained this example with the simple decision tree, but you can assume that for the same algorithm it also works if you have a tree assembled and basically we do this by batching all the trees together.

One interesting thing about this approach, that we call GEMM, because despite the generalized metrics multiplication implemented into tenser run times, is that we’re basically evaluating all the decisions and all the leaf nodes, all the possible parts, [inaudible] religion, if they live not altogether. That is what I meant when I say the computation redundancy. If you think about this, this is kind of strange because we are basically saying that we are doing exponential more work than you will do when you use or implement your regular decision tree traversal algorithm, but the interesting part is that since we can map this to a metrics multiplication, we can explain the domestic parallelism that is coming that we have in the modern armor. This allows us to achieve actually really good performance. Of course, this kind of algorithm won’t work in all the cases specifically, if you have like a really deep three, this won’t work because you’re doing a lot more comparison than in the optimal case.

In fact, in [inaudible] non-member, we actually have two different tree implementations. So, one that we use, that’s called the tree traversal that we use when the trees are more deep and there is another one that is called a perfect tree traversal which instead we use when the trees are more like a bushy shape. Let me go over this algorithm really quickly. So for tree traversal method, we store the information of the tree using five different vectors. The first vector basically stores all the left Ids. The right Ids are stored in the second one, then we restored the feature Ids in the third one, the thresholds Ids in the fourth vector, and the labels Ids in the last vector.

This vector basically has one column for each node that you have in the decision tree. So in this decision tree, that we use as example on the left, we have eight nodes. That means that this vector basically store it, have eight columns. How you read it is if you’re starting with the column zero, we basically say that if for the number zero, if you evaluated the condition and the condition is true, you go left. If you go left, the next node that you need to follow is not one. If instead you go right the next node that you need to follow is node two. In order to evaluate node zero, you basically have to take the feature ID with number three and the [inaudible 00:15:57] picture that you will be using is 0.5.

In this [inaudible] traversal, we use minus one, if that is for specific values or ques. For instance, since node number zero, this is a decision that we have minus one for the label, because there is no label attached to this node. Now, let me go over the algorithm again in this case. So we started with the root node, which is the node that we look at the feature ID at the vector, continue the feature ID, and we gather the feature Id at the column zero. Then we are using this feature Id we got from the feature vector. Then, we do another gather operation for picking up thresholds [inaudible] for node zero, then we basically have ,at this point, all the values that we need.

So we do evaluation between the feature and the tree traversal that we [inaudible] start distorting the tree. Then we use the wire, a tensor operation to basically say, okay, if the condition is evaluated to [inaudible] through do a lookup over the left vector, otherwise, we’ll do a lookup on the right vector. This is basically telling us which is the next node that we have to look at. We basically repeated this algorithm until we reach the final, the max of the tree. The perfect tree traversal method, the algorithm is really similar to the tree traversal method. The only things that we do is we introduce [inaudible] competitive redundancies because we make the tree perfect. This allows us to be a little bit more efficient on the algorithm side, but to introduce, of course, at the storage level.

So far I only explained like three algorithms. Hummingbird actually supports around 50 different algorithms, and we don’t support just three, but support for instance, linear model. We support some of featurization such as [inaudible 00:17:50] coding. We support PCH, etc. Let me briefly introduce the high-level design for Hummingbird. Hummingbird gets its input at training model implemented either on scikit-learn, extra boost, light GBM library, as well on extra time. We recently also added a few predators on Sparkman. So once you get past the train pipelines to Hummingbird, Hummingbird translates these pipelines into tensor programs, and then you can run these tensor programs using one of the deep learning predictions serving systems. For the moment, I remember support exporting tensor programs into Pytorch and TorchScript into ONNX RUNTIME or into tvm.

We’re working on trying to add support for other input libraries, such as Amelda net in each school and we also welcome contributions if someone wants to help us to extend the support that library or support the deep learning real-time such as TensorFlow and Telstra team. Let me give you a summary on the numbers that you can reach by using Hummingbird. For this experiment that I’m going to introduce, I’m going to use an Azure and NC6 machines with a XML processor with six car. One [inaudible] under under the GPO. For this experiment we basically crawl the OpenML-CC18 benchmark, and we got to the 72 data set and we trained around 300 secretarial pipelines. Among all the pipelines that are inside this open Seattle benchmark, I remember it was able to translate around 80% of them.

For the pamphlets that I’m going to show, we basically took the data set, do the 80/20 split. The training, we’ll use an 80% and do the testing using the 20%. There are many 20%. For the numbers I’m going to show, we use the touch screen backend. We export the model using Torch, and then we converted the Torch model using the Gita framework into a TorchScript problem. So let’s start with the CPU numbers. For the CPO numbers, we were able to facilitate around 60% of this OpenML pipelines. In the worst case, we are around 60 X lower than the [inaudible] regional pipeline, but in the best case, we were around two or three order of magnitude faster. If then we look at the GPU numbers, we were able to accelerate 73% of the pipelines.

In this case, we were in the worst case, we are almost [inaudible] two lower. In the best case, we are still more than three order of magnitude faster. So the slow down is mostly due to the fact that we just kind of run all day OpenML pipelines and some of them, they contain really sparse input data that doesn’t get any benefit to my GPS [inaudible 00:21:07] generations. Some of the models were really small like a decision tree with three nodes and in that case acceleration provided by non member, destination provider member it doesn’t help. Now it’s time for a real demo for Hummingbird.

Karla Saur: So, as you can see here, we have a notebook set up. The first thing that you need to do is you need to do PIP install, hummingbird ML, if you haven’t already installed it. Here we have so we’ll skip it. For this demo, I have everything, my model pre-trained so I’m going to need to import pickle to unpickle that. Highlighted is from hummingbird.ml import convert. This is the main function that we’ll be using. For this demo we’re going to be using this year data set, which predicts the year that a particular song was written in and we’re going to open the pickle data, load it up, and we’re going to just print the shape of it. We see around 50 K rows or so, and just a quick look at the data so we can make sure everything looks good.

So now we have our data. We need to load the model. We have our pre-trained model. It’s of, as I mentioned, light GBM. We have 500 estimators and we’re going to go ahead and read that model in and unpickle it. Load it up. Okay, the next thing we’re going to do is we’re going to time this model with light GBM. This will be CPU only because it only works for CPU. In this demo, I have pre-typed all of the timeit instructions. I have the field that we’re going to save stored and then the timeit instructions, but we’re going to type the commands that are used particularly to hummingbird in prediction. We’re going to type lgbm_model.predict predict over the data. We’re going to go ahead and let that run. The timeit function is going to run this a number of times and we see that it’s 800 milliseconds.

The mean plus standard deviation of seven runs is what it shows. Now to demonstrate hummingbird, we’re going to convert this to Pytorch. So we type model = convert (lgbm_model .) ‘torch’ in this case, and we can see that that gets loaded. Now we have our models. So we’re going to first run this on Pytorch with just the CPU and notice above we have the model predicted instructions. We’re going to copy and paste that exact thing. The syntax is identical so model dot predicts and let it run.

So we have 1.9 seconds. A lot of times Pytorch CPU will be longer. Sometimes it will be faster. It just kind of depends on the data set and the model, and also what underlying hardware you have. Now, time we play with the GPU’s Pytorch provides this very simple instruction to switch from CPU to GPU, just model.to Kuda. This is a PI torch commands, and that will switch the model so now you’re ready to run with GPU. Again, we can copy and paste from above the model dot predict of the data, hit enter.

We have 48 or so milliseconds, so significantly faster than before. That’s Pytorch. Now I’m going to show you TVM, which is a compiled framework so we’re going to be able to get even more additional speedups. We’re going to start again with the light GBM like we had before, and we’re going to say, modeled TVM equals convert. This time we’re going to go to TVM. I’d give the data. All right. Now again, very similar, exact same syntax, but we’ve called our model model TVM in this case. So model TVM dot predict with the data.

At the end, I will show you a graph with all of these values, but this CPU time is faster than the Pytorch CPU time. Going to go ahead and try this out with TVM and GPU so we can reuse the convert command. This time we’re going to tell it to go to Kuda and it will recompile everything, the model to have the TBM model with GPO. Now we can reuse the same instruction. Around 23 milliseconds. Alright now to plot them all together, I have a plotting function and you can see here the original time. The pie chart CPU time is about twice as long for this particular model. Your mileage may vary, sometimes faster, sometimes a bit slower. Then we see Pytorch. GPU is 0.0484 seconds and comparing that with TVM GPU, TPM is faster for both GPU and significantly so with CPO.

I encourage you to try out some of our notebooks and try them with your own models to see what results you get. Here’s our get hub, it’s get hub.com/microsoft/hummingbird. If you go there, you can check out some of our notebook, similar to the examples that I just ran. Now, we also welcome users to submit their new notebooks and what models they find interesting. Also on our website, there’s a little bit of an introduction, explanation of how hummingbird works, a lot of the stuff that Matteo went over, an installation guide, working on Linux back end windows, several examples, and we have documentation of our API, and you can also join our community on Gitter. We’re happy to have you provide feedback about features that you might want. We also have, if you go to our Wiki, we have a list of the operators that we currently support across scikit-learn and LightGBM, XGBoost, ONNX.ML, and SparkML. I encourage you to file an issue if you have some feature requests or run into some sort of a bug, we’re happy to hear your feedback.

That’s our demo. Thanks for watching. Here’s a few more additional updates that I want to leave you with. So far, we’ve reached around 21,000 poles on PyPI and 2,400 stars on GitHub. Please hit us up. We’ve demoed this at Microsoft Ignite and we’ve integrated part of Hummingbird with the ONNX converter tools. We also have an OSDI paper with a bunch of verified benchmarks and evaluations. Some of our newest features involve Pandas Dataframes, PySparkML support, and TVM support. I want to stress that we are very thrilled how much an open source interaction we’ve gone from community members. We have, of our contributors, all but two of them are outside of Microsoft so we would love for you to file an issue, a bug report, a feature request, any sort of that thing. Please reach out. I’ll leave you with our email address and thank you very much for listening.

Karla Saur is a Senior Research Software Development Engineer in the Gray Systems Lab (GSL) atMicrosoft. She finished her PhD in Computer Science at the University of Maryland, College Park in2015. Af...

Read more

Matteo Interlandi is a Senior Scientist in the Gray Systems Lab (GSL) at Microsoft, working on scalable Machine Learning systems. Before Microsoft, he was a Postdoctoral Scholar in the CS Department a...

Read more