Notebooks are a great tool for Big Data. They have drastically changed the way scientists and engineers develop and share ideas. However, most world-class Spark products cannot be easily engineered, tested and deployed just by modifying or combining notebooks. Taking a prototype to production with high quality typically involves proper software engineering. The code we develop on such larger-scale projects must be modular, robust, readable, testable, reusable and performant. At Montevideo Labs we have many years of experience helping our clients to architect large Spark systems capable of processing data at peta-byte scale. In previous Spark Summits, we described how we productionalized an unattended Machine Learning system in Spark that trains thousands of ML models daily that are deployed for real-time serving at extremely low latency. In this instance, we will share lessons learned taking other Spark products to production in top tech US companies.
Throughout the session we will address the following questions along with the relevant best practices: How to make your Spark code readable, debuggable, reusable and testable? How to architect Spark components for different processing schemes, like batch ETL, low-latency services and model serving? How to package and deploy Spark applications to the cloud? In particular, we will do a deep dive into how to take advantage of Spark’s laziness (and DAG-generation) to structure our code based on best software engineering practices regardless of efficiency issues. Instead of only focusing on code efficiency when structuring our Spark code, we can leverage this ‘laziness’ to follow the best software patterns and principles to write elegant, testable and highly maintainable code. Moreover, we can encapsulate Spark-specific code in classes and utilities and keep our business rules cleaner. We will aid this presentation with live demos to illustrate the concepts introduced.
– Thank you, everyone for listening to us today. At Montevideo labs, we work with top US tech companies helping them to build smart data products with a quality focus always in mind. Today we’ll talk about how we like to think about production ready software. When it comes to building such products, you know, notebooks. They’re a really good tool for creating truly reproducible research and it’s drastically changed the way data scientists and engineers collaborate, share ideas and create PLCs. However, if your company or product is large enough, or complex enough, chances are notebooks are not enough. And we’ve seen many cases in which a notebook trains a model and deploys it straight to production. At that point, the notebook becomes a software artifact and is subject to all the entities we know from software engineering. We have testability by doing that, can we ensure maintainability and reliability all the entities. If you want to tweak the data pipeline that creates a model can we guarantee integration is smooth. We’ll show that through a demo, which we hope you find useful. Even though our example is very simplistic.
It’s always easy to reason about this concept with a concrete example in hand at hand. Finally, we’ll share some takeaways from this process.
Building smart data products takes collaboration between data scientists and data engineers or engineers. However, a success story for data scientists and engineers can typically mean different things. Oftentimes, some of these objectives are at conflict. And for example, it’s hard to perform quick iteration for a production A B test if we need to ensure that model is being server truly thread safe. They don’t have memory leaks, they don’t get Of course, garbage collection issues, etc. At the same time chances are That those models that are better perform in terms of accuracy may require a large memory footprint and are prone to creating other performance problems. Typically latency. We strongly believe that culture plays a very important role. Great products can only be built with true collaboration spirit in mind.
So many times you tempted to have data engineers and data scientists do their own thing independently. For example, let’s consider the case we mentioned about deploying a model straight to production from a notebook. Even though this is common practice many companies it does come with some risks. We need to make sure that the assumptions that hold the training time are stable and continue to be true in the future. Typically models need to be refreshed and in many cases to be successful, that we need automated pipelines. When we inject a model as a live black box, we break bite of the CI/CD chain and if dependence are not explicit integration testable before release time, things can go wrong, unfortunately. To the advantage of integration tests is lost and the investment made by dev organizations and continuous integration testing and development is undermined. In the past, we’ve faced a number of issues with very good models that don’t play well when it’s part of a UI. For example, a sales prediction model, which shows a drop in sales when expanding to a greater region might be accurate based on the machine learning data and testing data. But it might not be correct or at least intuitive to the user. Adding a unit test for example, the checks for isotonic behavior might help catch this issue and also prevent regressions.
working independently is easy when it comes to assigning blame. The prediction is off the system to slow. However, this way we’re not really working together towards a common goal. We don’t collaborate to explore the right trade offs between complicity in the conflicts aspects I mentioned before So fluid dialogue, and collaborative culture here helps engineers. They know the very essence of the data that lies in the data in our data like was constructed. It can help data scientists avoid common pitfalls in the interpretation of this data. Likewise, data scientists, typically by doing analysis to find issues with the data that might shed some light on possible upstream bugs.
I had a professor A long time ago that reminded us that at the beginning bridges that his physical bridges, they fell down quite frequently. It took time for robust bridges and construction methods to evolve the right way, right. Likewise, 50 years ago, software’s had lots of bugs and products were really not great. But the industry evolved. We learn new ways to write code, obstructions, new processes, rights and wrongs. We simply can’t expect them magically now that we incorporate data science into our software efforts, things will be different. Let’s not make the same mistakes again. In an ideal world data scientist is free to choose any library any programming language, you know, Julia, our Python, MATLAB, Scikit-learn deep learning for Java whatever, data scientists can choose the exact library that starts that solves the exact problem at hand.
In an ideal world, engineers get models that are threat safe that they don’t have any issues with data pipelining, you know they are fast to predict, they don’t have garbage collections, you know, they’re stateless. These kind of things you know, the ideals for engineers. And Spark is a really good compromise at it provides solutions to vast arrays of data science libraries, it takes care of scales. It allows data scientists and engineers to speak the same language. Plus it comes with many guarantees when it comes to integrating it to our software products. So let’s focus now, on a demo.
We want to show how to take a notebook and break it apart into modules that allow for proper software entities, right. We want to show an example from a book I co-wrote with Saket Mengle. On a section of this book, we show a problem which will be the notebook and this is a recommendation problem.
And we provide a solution in Apache Spark
as well as AWS SageMaker. The problem consists of finding theme park attractions to recommend users based on the photos taking at each attraction if we assume that someone taking a photo of attraction is a proxy for interest in that attraction.
Then we can use the data to find other attractions people similar to you might be interested in.
This following slide is a notebook created from that problem. The Notebook massages the database run some SQL creates an LS alternating Square model, and inspects recommendations for to users Javier, how can we break apart notebook like this, you know and build proper software modules. – So as Maximo said notebooks are a great tool for research though they are probably not enough for large products. At Montevideo labs we do believe that high quality software products should be built using proper software engineering tools. Such as good for version control, or an ID. And these are not very compatible with notebooks. In this case, we’ll create a scholar project in Intelligent and use Maven to add the Spark dependency to it very easy. We do prefer Scala for our production code. But the concepts that we will present in this demo should apply perfectly well to any other language of your choice such as Python. The number one key to developing highly maintainable software is to break down your solution into smaller and more cohesive components that can be composed and reuse. Each of these components should be taken care of a very specific task, we do suggest that you break up your pipeline into smaller sub pipeline that can run independently and then you can figure out how to synchronize there inputs and there outputs.
For this system, our first high level component is an ETL, which basically takes care of loading the system’s data in a form that is useful for some of your components. In this case, we are using Spark data frames to hold our load of data which we will then use for populating some of the fields in the system. The good thing about data frames is that they are lazy so we don’t actually pay the costs of materializing the intermediate results between the couple of classes and functions. This feeds are a very simple framework. We developed which provides a way to communicate completely decoupled components very simply. And this is done by allowing a producer to put a new update into the feed and a consumer to retrieve the latest update in the feed just by complying with an implicit contract. We will show you a handy and simple implementation of these feeds that leverages the file system for durable storage. The second component in our solution is the trainer, which uses these data feeds to train any kind of recommendation model that can be used for serving recommendations. We will show you how we train and Spark our model like this.
The last but obviously not least important component in our system will be dedicated to serving recommendations using previously trained models. This one can be seen as a service which receives a user ID and returns a collection of recommended attractions for the user. And it uses an underlying model to predict this recommendations. In our case, we decided that our different components should be implemented in different color packages. But they could as well be different modules or different repositories. And they could even be implemented in different languages as long as they comply with the feats contracts. We also added a common package which contains definitions and utilities that are shared by many components. The first very important definition that we encapsulated and included in our common package is everything that has to do with the spike session, we suggest that you list so that you have one single place where you’re you are configuring your session. We also included the definition of our environment configurations. Allowing these properties to be configurable differently in each environment allows us to have one single artifact for our pipeline that defines the behavior of a system, which can then be adapted to running in different contexts by providing a different set of configurations. We included one implementation of treads that retreats the values from environmental variables, which is very convenient for a staging or a production environment. We also added support for programmatically forcing a configuration which is very convenient In the case of automated testing, very important thing that is included in our common package is the finishing of our feeds framework, which of course will be shared by the different components for their inputs and their outputs. We include a very simple implementation of the speed trade, which relies on a file system for writing and reading the updates that we put into the feeds. This way, every time that a new update is put into a feed, it is serialized and written in the file system in a way that you can then be read back. The implementation can easily be extended for any kind of serialization or we can implement but we are at out of the box support for parquet and for CSV. Ideally, our implementations of the fields should support keeping a history of all the updates that were added to it, even if we will always be reading the latest stuff. And this of course, will improve the auditability of our system. In the case of the file feed, we achieved that by writing the update in a folder inside the feed group, which is named after the current time stamp at the moment of writing, this is how easily we can configure a new feed. In this case, the visits feed. It feed is stored in a given path, and we recommend that you include some version identifier in that path to have more flexibility for introducing breaking changes in its contract, without breaking active consumers until they’re ready for the new version. Now, let’s do a deep dive into our first component detail. In this case, we have defined an ETL pipeline for two different feeds the attractions and the basis feed, which are both transformations of raw CDR data. Of course, for any ETL, we must find how we are reading or extracting the raw data as well as how we are transforming it. We recommend that you encapsulate each of those two in different independent classes, which will not only make your code much easier to read and understand, but it will also improve its extensibility and its testability, which are major things when systems can be The driver of ETL component is a very simple program. For now, it just delegates to a right implementation to load the visits and puts them in the basic feed. And as you can see them does the exact same thing for the attractions. Another thing that you might notice is that we are using what we call a data from descriptor to scrape both loaded data frames before storing them. This can be very helpful for debugging and again for the usability. The fact that we are encapsulating such descriptions in a special trade it’s not only for the sake of reusability, but also because it allows us to leverage the strategy design pattern so as to avoid expensive operations that are not necessarily always needed. And describing a data frame can of course, be a very expensive operation. While not very useful in a non debugging context. As we said, the outputs of different drivers are stored in the fields so that they can be retrieved by any independent component. In this case, these feeds will be consumed by the trainer In our next component, the trainer will provide one implementation of an attractions recommended trainer for Sparks alternating least squares algorithm. It should be very easy to extend this component to include other algorithms as well, such as the SageMaker version introduced in the note. In the case of Spark algorithms training the model is very easy to implement. We received the visits data do a very minor preparations for us to put in the format that the algorithm expects it and co fits on a previously configured us instance, we recommend that you keep any data processing that is very coupled to the algorithm inside of the same scope. And when possible, you should also consider using Sparks ML pipelines for this. We also suggest that you keep the configuration of the algorithm in a place that is very easy to find for people which are reading your code.
Another important thing that should be considered for machine learning algorithm but also for any other kind of processing that you are doing is that you use deterministic seeds for any kind of randoms that you are generating. And in this way, you will achieve results that are reproducible at any other point in time.
The train your driver program is very simple as well, it first gets the latest updates on the latest update of the visit feed which is fed to the algorithm to be used as training data. Here we are also adding a data frame descriptor in case we are in debug mode. We then put the model produced by the fitting algorithm into the output one or feed, so it can be picked up by any other component in the system. The consumer of his feed in this case will be the re-commender service, but we could as well have included any intermediate component that validated the model in a controlled environment before shipping it to the service. Another important hint here is that this model is obviously not a data frame, but we can still leverage the feats framework by providing support for writing and reading these models to and from a file system.
We include an implementation of a recommendation. The recommended use is really trained on Spark as model to perform predictions. As you might know, Spark ML models require that we create a data frame and then spark will create a prediction for each record you need. In our case, this is not very convenient as we only want to do a prediction for one single user. And creating a data frame for it is expensive in terms of computational efficiency, which affects the latency of a predictions. And it’s also not very neat in terms of code. You can watch last year’s summit presentation in which Maximo and I presented an extension of Spark that we developed for supporting role level predictions in a Spark agnostic environment. And of course, this can be very convenient when your service has a low latency requirement. That said, we believe that this implementation is good enough for the purpose of this demo, at least.
Okay, so Apart from that, we also included a very convenient decoration of the attractions recommender service. In this case, this decoration allows us to have different implementations of a recommender coexisting in one same service in a way that we can determine the study deterministically decide in which funnel a user should fall enhance which implementation we should use for recommendations. As you can imagine, this is very helpful for running experiments and things like measurable AV tests, and can provide a very interesting way of testing in a live environment any research or experimentation artifact. In this implementation, every time that we serve our recommendation, we use a hash of the user ID suffix we assault to determine to which recommended the user should be exposed. It is very important that this thing cash can be calculated in an offline analysis environment. So it is possible to attribute the different results obtained to each other user recommended use, we strongly recommend that you append assault to the user ID so that you’re not always exposing the exact same audience to experiments. But instead you’re rotating the audience.
Apart from this, we also included a very dummy implementation of a recommender, which receives a static mapping from user to recommendations and always serves those. We use all these implementations in our serving driver, as we are setting up an experiment that will use previously trained us model to serve recommendations to the first half of our audience. And we’ll use a static recommended to just blindly recommend the half of our users to drop anything that they are doing and just go watch the Spark Summit, because we think that everyone will enjoy it. Of course, we want to validate the performance of each of the implementations will recommender in the world. So we will be performing this A B tests instead of picking one of your recommenders over the other. We will then print the recommendations in the console.
As you can imagine, of course, our serving layer can be extended and decorated to include ideas as crazy as you can think.
Okay, so now that we have seen how we implemented each of the different decoupled components, let’s see how we can glue all this together and run them in one same program. In this very simple driver, we are running the three drivers in sequence and hence relying on the feat framework to take care of the components inputs and outputs. We could as well have a different kind of program that doesn’t need to use the dryers at all, but just mix and match as the underlying classes and implementations to run the system using Spark data frames instead of feats. In terms of testing, we don’t only structure our code and classes in a way that makes it simple and to add value in our unit tests. Also a very convenient corollary of having the ability to run this the couple components in one single program is that we can write very complete automated integration tests that run all the components have been performed assertions on the expected outcomes. This integration test only asserts that the supported fields are all populated. But of course, it could be extended to perform as many assertions as we want. The fact that we can programmatically force an environment configuration to be used by the system. It’s also very convenient for setting up this integration tests, so they are isolated from any previous ones. In this case, we are using the target directory in our project to store the feeds produced by a test a sitter, this directory is supposed to be clean after the testing is done.
Now, I will show you how we can run this test. You will see that the test is describing the data frames in the console because we set up the environment to have the debug mode enable Of course, we can also leverage the debugging tool provided by our ID to put a breakpoint at any point in our code and evaluate any expression that we find useful. As you probably know, this would be impossible to achieve in a notebook or when running spark in a cluster, even though it is a very common practice among software developers.
After we are done with all the debugging, we will continue running the test. And you will see that the recommend the service you see also recommended for one of our users and the static mappings recommended for othe user. You can also see up there that the testing framework accepts this run as a successful run, which is something that we could easily integrate to our continuous integration pipelines to add a very reliable check for our chains. Alright, so we hope that this minor example and minor demo is helpful for you. In any case, by the time of this talk, we have made available the code in our GitHub account. We please ask you to feel free to reach out with any comments or questions that you may have. And we will make sure that we answer them in GitHub. – Thanks Javier. Through this very simple demo, we hope to have shown that even with a very simple notebook, it does take some effort to take a prototype into a software product. Typically, a notebook reflects the end state of an idea, whereas the corresponding software product is a sort of a process that adds on many new features, so we think it does pay out the initial effort in the long run.
Creating cohesive modules reusable artifact eases the maintenance and quality by making it easier to test.
Furthermore, these artifacts can now be the starting point of a notebook, thereby making research process more effective. An example of this can be importing the jar into a notebook for using the Data Loader component or by calling the serving layer, whatever. Additionally, good software products allow for hooks in configuration knobs that allow researchers to test new ideas in production injected models, such as AV tests with much lower risk. This also increases the ROI of the whole process as we only promote to production the research ideas that have proven to be successful in control live experiments. We want to thank the Spark+AI Summit organizers for inviting us the speakers and making this such a great event, despite of the truly devastating pandemic effect.
Maximo holds a master's degree in computer science/AI from Northeastern University, where he attended as a Fulbright Scholar. As Chief Engineer of Montevideo Labs he leads data science engineering projects for complex systems in large US companies. He is an expert in big data technologies and co-author of the popular book 'Mastering Machine Learning on AWS.' Additionally, Maximo is a computer science professor at the University of Montevideo and is director of its data science for business program.
Javier holds a degree in Computer Science from ORT University (Montevideo) and since 2015 has been working with Montevideo Labs as a Senior Data Engineer for large big data projects. He has helped top tech companies to architect their Spark applications, leading many successful projects from design and implementation to deployment. He is also an advocate of clean code as a central paradigm for development.