This session, by the co-creator of the PyTorch framework, Soumith Chintala, will explore the evolution of machine learning frameworks through the eyes of three personas, defined as prod, modeler, and compiler.
Soumith will explain the software stack for machine learning before and after deep learning frameworks became popular. He also makes some predictions about the future generation of machine learning frameworks.
The session discusses how recent framework evolutions have heavily favored the modeler at the expense of the prod and compiler personas, and why the modeler has had the leverage necessary to make this happen.
Speaker 1: Hi, I’m Soumith, I work on machine learning frameworks. I started the [pride doors] project at Facebook among other things. And I’m going to talk today about machine learning frameworks, how they’ve evolved within certain dimensions of interest. And within this framework that you can think about, I’m going to talk about how they will continue evolving going forward. And I talk about the future as a distribution of things. And this talk is 30 minutes and the field you can talk about it for days. So obviously the talk is going to be simplified in various ways. So please bear with me on that note. But I still think just talking about a few dimensions here and talking through machine learning frameworks and their evolution within these dimensions is going to be pretty useful. Okay. So let’s start, I’m going to introduce three people. Two personas.
The first one is MODELER. A MODELER is someone whose job it is to look at the data and they assess the quality of the data they ask, “Hey, do I need more labels?” And then they start doing pre-processing or feature-engineering, and then they pick some way to do machine learning. They build an architecture, and then they encode enough priors into the learning either by some tweak of the architecture or some regularization scheme. And then they build a training pipeline. They do machine learning to solve some tasks either of research interests or business interest. And then there is the second person I call PROD and PROD is typically the person who MODELER goes to when they actually want to reliably ship something into some critical part of some tasks. So reliably ship it to what we generally call it production.
So PROD usually tries to make sure you’re able to version your model so that in case something feels wrong, they can roll back. And that you’re able to version the data that comes in and goes into the models when they’re trained and they also generally make sure that all of the metrics that they monitor are within acceptable ranges, and they make sure that new models that MODELER has given them are within acceptable ranges of performance to keep costs or power down. And they make sure to do that in coordination with the third person I call COMPILER. And what does COMPILER do? COMPILER’s job is to map models that are either, the MODELER has given, either for while they’re still training the models, or when they enter production to map those models as efficiently as possible onto hardware.
That could be server hardware, that could be accelerators, that could be phones, that could be some embedded systems, that could be the Mars Rover, anything. So their job is to squeeze the best performance out of the models either maybe say performance per watt or performance per second, or performance per dollar. That’s pretty much it. And so COMPILER can even be, even though the term is COMPILER, they can even be a hardware implementer. They just build new hardware, someone like Nvidia. Okay. So let’s talk about how the software stack… Don’t forget the persons but I’m going to just quickly talk a little bit about how the software stack has evolved over time and that’s kind of important. And then we will actually tie this to the personas. So before deep learning got popular before 2012, you typically had a software stack that somewhat looked like this where a lot of focus was on pre processing, feature engineering, [plus processing].
And so you had domain specific libraries for that. And for the machine learning models themselves, they had a very small way to interact with software packages of libraries that built those machine learning models and train those machine learning models for you. So if you ever use XG boost or scikit-learn or [Wilbo Rabbit]you give some kind of configuration of what model you’re building, what learning grade or regularizer, or how many trees in the forest and so on. And once you build that config, you give that to a factory and then along with that you give your data that is in some preprocessed or clean form. And then the engine, the software engine that implemented a particular machine learning algorithm just handles the entire stack of the training loop and all the implementation details of the model and then pre 2012, they’re mostly mac to CPU’s.
So something like XG boost would specialize a lot for grit and boosted trees to always have the best performance in register of CPU’s and do all kinds of tricks and things that are very specialized to boosted trees that make it go faster. And then one thing to recognize here is the model in this context, it’s typically a configuration that is generally small and usually readable by humans, and then a blog of [inaudible] that are stored on some blog format, maybe on discord or in memory. So enter deep learning. Late 2012, deep learning got popular, deep learning is nothing but neural networks or differentiable learning. And it got popular and hence came the frameworks that enabled MODELERs and COMPILERs and PROD to practice deep learning. So in the post deep learning world is how the stack looks like.
So the stack looks like you have a very large API surface in the middle so mainstream learning frameworks like [Pyros] or TensorFlow have thousands of functions in their API. And these thousands of functions are stringed together by MODELERs to build models. And they can look like all forms of shape and size. And below that you have data structures, typically tensors, say dense tensors or sparse sensors, or within dense tensors you can have layouts of memory that might make computation more or less efficient. And then you have a bunch of hand optimized functions that are typically written by high-performance computing experts that map these APIs efficiently onto accelerator hardware. You also, in the last few years have been seeing COMPILERs pop-up. So XLA or [Tort script] or TVM are examples of COMPILERs that take whole models described in the APIs of these frameworks.
And they map them more efficiently to hardware than stringing together hand optimized functions. And lastly, you typically have a distributed transport layer that enables these models to run on multiple devices at once or multiple machines at once. And on top of this API, you have domain specific libraries that make it easy to train your models within particular domains. Like for example, you might have computer vision specific pre-processing or functionality that all computer vision people can use together. NLP, audio, they generally come in all flavors and sizes. But you also have high level frameworks such as fast.ai or Keras or Prytorch Lightning, who try to bring that pre deep learning convenience of quickly describing what you want to do or quickly fitting your data to your model, instead of robustly implementing everything and manually. And then on top, you have PROD tooling such as a TFX or [Tort serve] or various kinds, SageMaker or the Spark AI is starting to have some tooling.
So the general mainstream deep learning frameworks do a full word vertical integration across the stack to make things pretty efficient. There are particular solutions by various parties that only focus on particular parts of the stack and they interface cleanly with the rest of the stack.
So, one thing to recognize here is in this post deep learning mainstream machinery and frameworks, Pytorch and Tensorflow model is described as code typically code in some language that is basically, it’s not a configuration file or a Json blog anymore. It’s actually complicated code which can have loops and various structures that you typically do find associated with the programming language. And then weights. Weights are the same, they are just blogs. So numbers that are stored somewhere. So it wasn’t always like this. That picture didn’t always look like this. So just after deep learning got popular, you had various frameworks [inaudible] which is the framework that started the revolution and then [Cafe one]. And then I used to use this framework called [AB learn] and they had a much smaller API surface, and they had lesser data structures. They had only hand optimized functions.
They didn’t have COMPILERs. They typically didn’t have distributed support. They didn’t have much going on and they didn’t have an ecosystem of domain specific libraries or utilities on top of them. And in their regime model was still described as a config, you know, like as a or a J song, or like custom lead defined configuration files and weights. So it was still basically transitioning from that pre deep learning world. And that was what was most convenient and then slowly the… But you actually had contrary examples as well. So Theano, which was actually much very ahead of its time had models being described as symbolic graph. And that large API surface it basically makes writing the framework really hard. So Theano had a COMPILER, the COMPILER was really slow, or wasn’t very efficient. And that largely made things very difficult.
And eventually things evolved. There were, I think, tens of frameworks and they all have evolved to only two surviving as mainstream frameworks. And those two are Pytorch and Tensorflow and they both do model being code and weights. And one thing to ask ourselves is why did we enter this model equals code regime? Why didn’t we just stay with config files? And one of the reasons is basically MODELERs were pushing the limits of frameworks and they were implementing ideas that looked more and more like real programs. They had dynamic control flow, dynamic shapes. Basically the shape of the input tensor is changing from one iteration to the other, typically seen in object detection or NLP. Or if you looked at say a Gan training, it was very different from say standard image classification or any kind of classification where typically you just did forward, backward and then update, and then went to the next iteration forward backward update.
But Gan training changed that loop which means some internal details of these ML frameworks were no longer compatible with what MODELERs wanted or semi-supervised learning takes that to an even more extreme. Schemes like BYOL, which, or SIM CLR, which became recently popular. They have a very complex training regime and the training loop itself is very involved. So again the whole a field evolved towards convenience of the MODELER or convenience of expressing ideas of MODELERs. And it did come at a big cost. Both COMPILER and PROD were generally unhappy because their lives got worse. It became harder for them to write a COMPILER or map efficiently to accelerator hardware. If you’re talking about more general programs and same with the PROD. If model is a config plus weights, PROD could easily version models and such, but that wasn’t the case anymore with model becoming code, then PROD had to figure out how to debug models in production and all kinds of nasty issues and PROD wasn’t happy with this regime either and isn’t.
So you can ask, “Oh, there’s three people,” somehow model equals code stuck. And then the second question you can ask is why do we have such a large API surface? That’s not where we started, right? Cafe or [Puda Continent] had typically a very small API surface. And again, it has to do with the fact that every few months people publish some disruptive new results that involve some new building block or some new training regime that has to be expressed in different terms than a previous mid-level building block. So for the large part, we had these ML frameworks evolve towards very low level or mid level building blocks and a lot of them to express all the mathematical functions and ideas that MODELERs had. It again, was because of the convenience of the MODELER.
And it came at a cost that COMPILERSs and PROD were even more unhappy. So why did MODELER get so much leverage? If there’s three people in this ecosystem, why is MODELER getting so much importance? Why do they have so much leverage? And that’s a fairly important question to ask. The reason is because MODELER is credited largely with making progress in the field. So AI after 2012 slowly increased in hype to a point where everyone wants AI to do everything in the world and MODELERs have been credited with trying to keep up with but going towards that hyped up world and making progress. They’ve been the ones who are creating all the new value. So the role has been evolved. The AI, ML COMPILER software, whatever stack has been evolving to be taken care of. (silence).
Soumith Chintala is a Researcher at Facebook AI Research, where he works on high-performance deep learning. Soumith created PyTorch, a deep learning framework that has traction among researchers. Prio...