Scaling and Unifying SciKit Learn and Apache Spark Pipelines

May 27, 2021 12:10 PM (PT)

Download Slides

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.

Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.

Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

In this session watch:
Raghu Ganti, Principal Research Staff, IBM Research

 

Transcript

Raghu Ganti: Hi, today we’ll be talking about scaling pipelines, specifically scikit-learn style ML pipelines, and combining Spark ML pipelines with scikit-learn pipelines, and using Ray as a platform for scaling and unifying ML pipelines. My name is Raghu Ganti, I’m a principal research staff member at IBM’s TJ Watson Research Center, and there’s work that is done in a broader team across IBM and Red Hat. And here are some of the team members that are working on this project.
Why pipelines? I mean, there is just so much information on pipelines and so many variations of pipelines out there. I’m just listing a very few set of pipelines here, starting with one of the famous things as Apache Airflow, where it has been adopted and has been leveraged in many use-cases. And Kubeflow is more recent, which is targeting ML kind of pipelines.
There is of course, the classic scikit-learn pipelines, there’s is Luigi, Spark has its own version of pipelines, which sort of mirrors scikit-learn pipelines and uses the same kind of terminology, and many such other things.
What is our motivation? Why do we want to do pipelines? Our motivation is really, can we do pipelines on Ray as a platform. And people may be very familiar with Ray in the past, especially given that there were a lot of talks around Ray in the Data and AI summit. And the key innovation from Ray is in terms of the ability to distribute genetic purpose programs and Python functions.
Other question was, given that something like scikit-learn is a pipeline concept that has been adopted by a lot of the ML community, especially data science community. We wanted to explore these ideas of how do we take scikit-learn pipelines and scale it out with Ray as a platform. And there are all sorts of advantages that Ray brings then, especially with that distributed object storage, which many of the other pipelines lack.
When there is a close relationship between Spark and Ray, in terms of sharing data across multiple nodes. And that’s also why we believe that you will find Spark and Ray pipelines is the key in making pipeline scaling successful. If you look up, what is today’s pipeline, especially the scikit-learn pipeline, and it’s a mirror in terms of Spark?
Basically there are two primary components that it brings in, one is this concept called Transform, where you take in an Ray X and then transform into X prime. And then the other is a concept called Fit, where you have X and Y and then you create a fitted model that comes out of it. Given these two constructs, the key question is, how do we sort of bake these constructs? How do we introduce scaling parameters, which fit into the Ray platform and how do we start integrating these two components in Ray and Spark together?
We introduce this notion of what we call us a list of objects. I do want to call out right now that this is a proposal that is in work in progress. We have recently submitted this proposal to Ray community, and there is a lot of discussion around it. I’m happy to share the link at the end of this talk. But I do want to call that out early on so that I can set the right expectations to the audience.
And in terms of scaling, we believe that concept of scaling comes from being able to do a transform or a fit in a chain on multiple data items. You’re not just looking at a single large array data, but you’re able to do it on multiple data items. And this is very common when you are in an enterprise setting, where the same pipeline is being leveraged for multiple data sets.
It could be one specific data set from a specific day, or it could be a data set from a subset of sensors, or subset of selection from a table and so on. But this notion of a list of objects as an input and a list of objects as an output, as well as the list of objects to whether it be the fit or the transform function, is one key parameter that we introduced, which will enable us to scale.
The next key perimeter that we believe is extremely essential. And in fact, also missing from the scikit-learn pipelines is this notion of an and or graph, or an or node and an and node. What does and or graphs introduce to us and how will they help us in scaling? Right? A lot of people may be familiar with this concept of and or graph, it’s a fairly simple concept, and or node is taking a single input and is doing a fan out into multiple steps.
And these are very common in when you’re exploding in an auto ML style pipeline. You want to explore multiple models. Do note that this is different from hyper-parameter exploration, hyper-parameter exploration is more around, “Hey, what are the different parameters that I need to explore in the parameter space, so that I can find for a single model, what is the right fit?”
What we are trying to do is more about, how do we look at multiple models? So you might want to say, “Hey, I want to try out and XGBoost stream. I want to try out a Random Forest.” And all of these are valid options, and the data that feeds into these, each of these individual models is the same dataset. So you want to explore all of these in [inaudible], and that’s the concept of an or node.
The second is a concept of and node, and node add generality. The basic idea behind and node, is that it’s taking input from multiple or nodes and multiple other nodes, and then sort of combining these inputs. You might think of it as like, “Hey, I want to apply a PCA transform across inputs coming from multiple different nodes, or I want to apply something like a Gaussian mixture model or DGM kind of model, where data is coming from multiple nodes and they will combine them and then send back a multiple set of data items.
And note here that each of these Xs that I’m pointing out, In fact, the general list of objects and not just [inaudible] objects. What are the key features here and how do we sort of think about these concepts in Ray? I think that’s a key question. From a Ray standpoint, there’s basically tasks and actors. So far, all of our nodes can be achieved tasks, and we’re basically a unit of compute as a Python function.
We assumed that a Python function, as long as it inherits from the basic transformer, which is part of scikit-learn, we are good to use that function. And because we’re inheriting from a transformer, data scientists would find this extremely intuitive. And we do of course, use the same transformer APIs, and you can look at the base classes and ensure that we are basically inheriting it the right way. Now, the key is this list of objects and the object references.
In the last couple of slides, what we have seen is all these X and Y, basically the key for scaling out is X and Y being part of object references. If you look at Ray as a concept and a platform, Ray’s core parameter are in terms of these tasks and actors, and then being also to be able to explore this distributed object store. All the four objects are passed through the object references, and which also means that sharing of objects becomes the trivial, which in turn leads us to do zero copy object sharing.
What does this mean? If you start digging deeper and this would be very useful for someone who reads the scikit-learn pipeline implementation. The scikit-learn pipeline implementation does a clone of some of these objects, make sure that everything is taken care of, conscious all of these objects and so on. All of these parameters are naturally taken care of, because of the Ray’s platform and the distributed object store.
We leverage that in order to be able to share data in a distributor manner, which is the reason why we can scale any number of tasks that are there. So if I have a pipeline, I want to explore a million different kinds of models. I will be able to just leverage the Ray platform and then be able to scale out on a large cluster, because underlying compute is fully part of the scale.
And the other thing is, we are working with RayDP, which is an open source project coming from part of the Ray community, where RayDP is looking at taking Spark, and using Ray as a backend for Spark. And the ability to take all the Spark data frames, and then put it into object store so that it is shareable. We want to be able to combine the power of Spark, which is in terms of rich data parameters in applying the dataset parameters, RDD parameters, and then being able to do multiple iterations on top of this data parameters.
We can either cross environment, where Spark is running with Ray as a backend, where scikit-learn is running with Ray as a backend, we able to have an efficient data exchange using the distributed objects store and task management and active management is done efficiently using the underlying Ray platform.
What we are focused on is this enrichment of the pipeline parameters themselves, so that the ability to scale on these platforms becomes easy and becomes convenient for the data scientists to adopt and leverage. And that’s that our enrich DAGs are coming in. It’s not a vanilla pipeline that you’re introducing, but we’re introducing the concept of DAGs through the and or graphs.
In the future, we do plan to expand on top of these and or graphs, and start introducing more interesting primitives, like finding semantics, which come from the actor world and exploration that has gone into the spectrum concepts and so on. So we do want to bring some of those key concepts so that we can do scaling of these and or graphs. So far, what is the current state and where are we today?
If you look at it, we have a basic implementation that is in terms of using Ray as a platform, to do scaling in terms of list of objects, as well as some of our or node parameters. And this is an example that we have taken a very well-known scikit-learn example that comes from a [inaudible] competition, where we look at a preprocess pipeline, that explodes three different types of models, which includes Random Forest, Gradient Boost, and decision tree.
If you take a vanilla scikit-learn pipeline, which is shown on the top right corner, we are able to with minimal changes to the cord, improve the speed, 2X. And this is primarily because we’re not getting the full linear scale out in this particular context, because the data is much smaller and we are not looking at large datasets or multiple small datasets.
But it is primarily because of the overhead that’s introduced in serializing these serializing, which is showing up, but given the smaller dataset, the overhead is small enough that we are able to get it to speed up. We do expect that these kinds of things will dramatically increase in their speed. Why are we really interested in this?
There are several scenarios which can benefit and gain dramatic speed-ups. And these could be in terms of say, auto ML kind of pipelines, where a lot of exploration on even smaller datasets tends to happen to be able to identify what is the right set of automatic transforms you want to apply. And how do we do that in a scalable manner, without limiting ourselves to say, traditional mechanisms of scaling or trying to replicate a Python pipeline, or the scikit-learn pipeline using multiple containers, which tend to add a lot of overhead when you have multiple containers running.
That is one aspect, the other aspect is if you start thinking about, in terms of all the language models that are coming up, how do you start doing exploration of these language models at scale? Becomes a very important question. Today, if you look at it, language model exploration is happening only in small pockets. I mean, it takes a very long time to train a new bird model or even adapt the new bird model to a specific task.
And when you look at these unsupervised language models, training the language model is not the end game, the end game is in being able to leverage it in the context of a specific task. You want to adopt these language models for a specific tasks. How do you start thinking about these language model training and scaling out that pipeline in transfer learning, and then the ability to be able to scale this out in the cluster more essentially.
Those are some of our targets of where we want to go ahead and impact. And that is to… Of course this is sort of, comparing where all of these pipelines are and where we are today. If you look at what are the key parameters that we are bringing in, there is task parallelism, which of course everybody targets, whether it be Airflow, scikit-learn and so on. There is the ability to scale that out.
Of course, scikit-learn is one player, one which does not target task parallelism, which is where things like Airflow and Spark superior. When you start looking at data parallelism, I believe it’s only spark that does data parallelism by introducing this notion of RDD and datasets. Where as I think having the notion of Ray, having a distributed object store is something where we are introducing data patterns from.
The other aspect is and or graphs, traditionally, if you look at Airflow and Kubeflow, they do have the notion of and or graphs, because they are very generic pipelines, they’re able to combine and do a lot more energetic tasks. And we are targeting this and or graphs, because we believe that if you provide the genetic component inside something like a transform and fit, which is basically preempt over scikit-learn and Spark pipelines, we will be able to achieve better scale out. And that’s where the and or graphs come in.
From a computational unit standpoint, you can start thinking about Airflow, Kubeflow, and these kinds of pipelines as containers as computational units, and the scaling component is coming from the container scale-out. Where as if you look at say, scikit-learn or Spark pipelines, we are looking at the computational unit as a Python or Java function.
And this is in combination with the and or graphs, is the key difference of what other pipelines are bringing when you compare to the existing pipeline. We had the ML pipelines because like scikit-learn and Spark pipelines, and then you have this more genetic pipeline concepts like Airflow and Kubeflow, so we are basically bringing in sort of the best of both worlds using Ray as a platform for doing that.
The other aspect is given the structure of how we are dealing with pipelines, we also support mutability of that. You would be able to at runtime, add new nodes, and be able to continue with the execution. And again, this goes back to the co parameters that I was talking about, where you have this distributed object store, you have your objects which are catched and kept in memory, and essentially ready and hot to be used for the future.
Essentially so far, whatever has been executed cannot be changed, but you can basically say, “Hey, you know what? XGBoost is performing very well, so let me try to do a variant of XG boots, and add that as a new node and a new edge in the DAG itself.” And that’s something we do intend to support and the design itself supports that particular concept.
Where do we go from here? There are several other ideas that we have been exploring. And specifically, as I mentioned earlier, the proposal is out there in the Ray community. And we are in very close discussions with Ray community to be able to, in fact, even potentially opensource this project and get the current implementation and keep adding into that.
There are other concepts that we want to explore specifically around the execution strategies, because ultimately once you have a DAG, execution strategy of that DAG, in the context of identifying the right pipeline for your ML job, becomes very important. I mean, whether it be a Breadth-First Search execution strategy, or Depth-First Search execution strategy, or maybe you might say, “You know what? Let me explore the those options, which are faster at first and then go with the slower options later on, or some mix of all of these combinations.”
And providing those execution strategies is going to be a key for scaling these pipelines as well. And that’s something which we are exploring right now. All of these topping criteria is extremely essential. If you look at pipelines in general, you might be able to come up with a good measure of saying that, “Hey, by the way, this particular pipeline is only giving me 80% accuracy, and I don’t see it increasing in terms of its accuracy in the future, so let me expose the pipeline and not explore it further.”
Or perhaps you find the another pipeline, let’s say it’s at 95% accuracy. I’m good to go because this is the model that I’m looking for, that’s a baseline accuracy. I don’t need anything beyond that. And then of course, I’ve talked about mutability of execution pipelines. All of these are concepts that we are planning on building on the base parameters that we have so far, and which are all fit into our design.
Happy to have people come and talk about what they think and their concepts are and what they would want from this specific proposal that we are working on at the moment, and look forward to the feedback from the community. With that, I’m ready to take a Q&A, and there’s a broader team that is cutting across IBM and Red Hat, that we are jointly collaborating on this effort.
And we are looking at talking to the Ray community and open sourcing other work as well. I think at this point, your feedback is really important. And we’re looking to any contributions from the open source community, especially on how do we integrate Spark and Ray, and in terms of pipelines? And any feedback on that would be extremely useful to direct the project and further it, so that we can achieve a single unified pipeline across Spark and Ray and scale it out using Ray as a backend.

Raghu Ganti

Raghu Ganti is a Principal Research Staff Member at IBM's T J Watson Research Center. His primary research interests are in the area of AI and ML at scale. He has also received numerous outstanding te...
Read more