Python Notebooks are great for communicating data analysis & research but how do you port these data visualizations between the many available platforms (Jupyter, Databricks, Zeppelin, Colab,…). Also learn about how to scale up your visualizations using Spark. This talk will address:
Data visualization is the only way most analytics consumers understand data science and big data. It’s challenging to visualize big data, and harder to get this to work across multiple open platforms. Double down on the difficulty for rendering 100,000 visualizations needed for ML Operations automation and data driven animations. Popular Python based Matplotlib, D3.js based, Bokeh and high density visualization packages and best ways to integrate those with massive data sets managed by Spark will be the subject of our presentation. We will demonstrate common strategies (image, svg, HTML embed) and gotchas common with integrating Spark, Jupyter and non-Jupyter environments. Headless data visualization strategies are used to automate Machine Learning Operations and data driven animations. A Python notebook will be the center of this demo. The strategies presented are accessible by those with a passing experience with Python based data visualization packages.
– Hello, everyone, this is Douglas Moore. Today, I will be talking about how to make your data visualization’s portable and scalable.
Portability is important because imagine that you have a data vis on, let’s say on your laptop in Jupyter.
And you’re doing fine, you’re creating your analytics, sending it off to your customers and your boss and your teammates and you’re doing fine there. But then, the world changes, let’s say a company needs to move their data infrastructure, analytics, infrastructure to the cloud. You need to go on vacation, you need a backup. You need to automate this somehow. And just running as an overnight job. You need to scale up your visualizations. Not from one or two or 10 or 20, the thousands, 10s of thousands of visualizations, you need to store those visualizations along with your data. And your analytics for access via an application. You need to store this for audit purposes or approval purposes to prove that your model is valid. So these are all important reasons why you need to portable data visualizations. However, portability is hard. Every platform has nuances, there are basic assumptions with your setup with Jupyter notebook, it’s local to compute to Python, your CPU, and your data is local to your Jupyter notebook. And so some of the packages have taken advantage of these assumptions to help scale up to interactivity across a large data set, but moving to the cloud or another shared platform, those assumptions are broken. Security very important. There’s certain lack of security between your browser accessing local data, because that’s a trusted environment. But as you move in the cloud, a lot of these packages are gonna try to access data outside the browser. And that’s gonna get blocked, with course, header and CSRF type things. There’s also a difference in bandwidth between the compute and the display. There’s, again, you’re probably dealing with much more data. And so all these assumptions have changed. And this in part, ’cause a lot of packages have been built and designed based on certain assumptions. So a lot of these change as you move to cloud or another platform. And so that presents a challenge. So we’re gonna talk about strategies, about how to deal with that.
They’re promoting unification in the back ends pandas is near universal in terms of data science analytics in Python, and pandas has a plot back end and default there for many years is matplotlib. But with that work going on, now you can see your back end drawn with bokeh or HollowViews and HollowViews and that should by the datashader back end. So again, lots of changing, if you will, in the data visualization landscape. So let’s talk about some strategies here. So I’m gonna walk you through my demo notebook, which I’ll share later.
So the first one is, I call “Hope, Magic and Hints.” Again, you have your standard, seemed to have lost my pointer.
Seriously lost my pointer. Okay, here we go.
So, standard matplotlib, plot.show you expect to show up but again, things are a little different. So well with daybreak shifted you display this whole with no arguments to go out and look for that failure object and display here in the notebook. Different notebooks can different Soniliz, I think in Zeppelin and z.show.
Jupyter it’s display, so on and so forth.
Now there’s matplotlib inline with matplotlib to draw inside the notebook.
This tightly binds it to the, the IPython object. So we have that here, increase compatibility. Now the next major strategy I wanna talk about is to render as an image. Now, almost every package will support rendering as an image. Sometimes it’s simpler, sometimes that can get quite involving.
We’ll see that a little bit.
And the first one is render to an in-memory buffer, I like these because you don’t name your temp file, you don’t have to delete it when you’re done. So here we do it, we save the figure as a temp file, we’ll give it a format. Now, if we want to actually view it again, you should always close your plots. If we wanna view it again, we can get those bits from that buffer, base64 encoded, use UTF-8, embed that into a little bit of HTML. Now we have an inline image. And then we can use work force, every package. Every notebook has, you save some form of display HTML here, boom. And you can see along the way, that HTML, we added a title. Alright, great. So we can render for some packages to, now it’ll buffer, but sometimes graphics packages, and I found some in that landscape that don’t support by Tayo or StringIO. So what do you do next? You have to save it as a file. Sometimes you need to plot that and store it for audit purposes, or you wanna embed it in Markdown or HTML. So in here we go. There’s this tracking API for ml, machine learning experiments called MLFlow. Very popular these days, open source package. So in here, imagine this code was for doing your training and other experiments. Now you wanna say you create some artifacts improve the worth of your training. And so now you just save that and you save it to a real file path because that’s what MLFlow is gonna need here. When you log that artifact. MLFlow it’s gonna log the file path as a key to it. And then going to take the contents of the image and log that. And now you can get to this, MLFlow and the tracking that it’s been doing, just by clicking on that link in the notebook, you can see all the metadata associated with your experiments, and artifacts.
And here’s our plot here that we logged into tracking API. So that’s pretty cool, to be able to track all your experiments and log all artifacts, and then again, that goes to reproducibility, auditability, and sometimes you want that image and you wanna embed it with other texts and other media, if you will, to describe it. So this is an example of embedding it inside of Markdown. You can embed in HTML and so on and so forth.
So okay, as long as you set the back in on pandas, you’re good to go. Okay, so now, let’s say you’re doing animations and animations like matplotlib. I works closely with the the event loop in Jupyter hub, which is very close to the Python to compute there. We don’t have that luxury in notebook that’s in a web browser. The data center, where the Python is running is in Virginia, I’m in Boston. So that’s not gonna quite work here. So what we do is we do matplotlib, we build up our animation here. So this is our animation object. And now we just convert that to HTML. And then we display that in the browser. And we have this, animation. Now this is nice. You can embed it in web apps. Any of these techniques that generate HTML, you can embed it a web app. And I try.
We’re gonna have 10 partitions, in this. We’re gonna create some random numbers. And we’re gonna display that and we get that here. So that’s from the previous run. Let’s see, we’re gonna start this up. Only now you’re gonna use that cluster. Anyway so that’s how you go from your daylight through Spark, your driver to a pandas data frame.
With these two pandas, to your visual. So that’s how you connect all the dots. Pun intended. Alright, now, I don’t know about you, but I’m getting tired looking at this file HTML CDN display HTML. Just wanna look like that does work like it does a lot simpler. So to do that, install customer alert packages and the ability of the back end, if you will, and customize it or just redo it. So we’re gonna do that with Bokeh. This is our show doc book and take the bokeh object. And we’re gonna do what we’ve always been doing, But we’re gonna install this hook, these callbacks like this. Now, the only one I’ve gotten to work thus far is the show doc. I don’t know what the load does. The show app is for interactive apps, which aren’t supported yet. We haven’t figured that out. So we have this. Again, you do normal setting, you’d see something like this in Zeppelin and then you just render, you do a tsplot.
Again, pan this and what’s going on there, that unification of that you can switch out the back end by specifying the backend that you want.
Your go and their time series here. So that’s it for that series. Now we’re gonna go on to Altair. Altair is pretty tricky here. So again, we’re gonna render this as a PNG, actually, we’re going to skip PNG and SVG, and go straight to PDF. But you have to install Chrome, as a web browser on your driver node in the data center in Virginia, so you do this, there’s new package Altair saver, which helps things out. Of course restart your Python context. Again, Altair protects you from rendering too much inline data.
Default limit is 5000 rows. So if you wanna exceed that, you have set max rows to some number. I know my dataset is using about 50,000 points. And then I wanna say, verify the data format. So we have PDF, and SVG, PNG, and so on and so forth. So we skip over those two, and we’re gonna save as a PDF. Again, we saved chart target, it looks at the target, the file extension there, sees that’s PDF, saves it to this location file store, to boom, we have that and that’s available through displayHTML, we’ll get a reference to it.
And now we can download this. Of course, on the back end, we can automate that and email it off to someone.
And there’s a chart, nice thing with PDFs is it’s very, very reproducible.
So but we don’t have enough time for that. So that’ll be saved, perhaps, for a blog post or something. Now, let me talk about Spark and generating plots within Spark Executors.
So to do that first we’re gonna take the strategy of creating a user defined functions.
Now, you might do this. So to process, trillions of data points spread out across sensor logs that you have in your data lake.
So this function, again, this is for demo,
but are decaying sine function here parameters A and B. We’re gonna create plot of that. And so now we have a function of the (mumbles) plot. We’re gonna create a function and take that plot or that figure and convert it to an image. And we return this here. Now Spark now understands the image datatype inside of a data frame. It looks like a structure like this. So now that we have have that full and then from data to figure to image, again, it’s a headless rendering strategy. We have that wrapped up in this function called get_plot. Get_plot here in the image scheme, we’re gonna create a user defined function, a Spark user defined function. We’re gonna use that in Spark SQL as shown here. This is my unit test, like the test code and stuff. So we do that, pass a couple parameters, four and five and 500 points, wherever it became sine function here. And that was a Spark job that ran. Now, already we can generate, 10, 100,000, million of these, the different functions again, this is synthetic, but this could run over all your source data. All your sensor logs, if you will, and create individual plots and store these images. So now we have images inside of a data frame. We have data fields, we’ve projected from that metadata. And that’s gonna be important. So now with these plots in the data frame, of course, with Spark, unified API there, you can save that data, you can save the parquet, save the Delta, Delta tuition transactions. So you save this. And now you can go back and record it, you can optimize the Delta table. You don’t have millions, billions of small image files in cloud storage, which, will be slow the store and retrieve, but here, we can retrieve that based on metadata associated with the image. Or join it with contextual data. We’re gonna retrieve just the images that we want to perform our function or an analysis or other engineers could come and look up where what our computation model. So this is how you kind of scale that plotting out horizontally.
Again, it’s all based on the headless technique. So in summary, I provided a number of portable data visualization strategies, all the way from creating an image buffer and then manipulating that to an image file for strategies for working with HTML and JS type packages, like Plotly and Bokeh and being able to run through that across multiple notebook platforms.
And then also, because this is the Spark and AI summit, we have to talk about the data lake and how to how to address all that data of your enterprise and how to visualize that. The hooks, trading your own hooks to make porting so much easier and cleaner, all the way to headless Chrome browser. Back in the day, there’s a headless X Windows Server. And so one of these packages requires a headless X windows strategy to in the future, we’ll get into discussing how to establish an API proxy and and enable Altair and other packages like that. And then we shared how to scale out with Spark. Scale out horizontally to those graphics as visualizations needed across not 10 or 20 data sets or time series but 10s of thousands, hundreds of thousands of time series like similar customers have done. So and then next slide, I’m showing the resources. We have my GitHub account the slides will be posted there. I’ll cross post, there, here. They’ll be available from the organizer of this and my demo notebook. So you can run through this. And I’m listing references here for your reading pleasure.
I'm passionate about helping customers find value in data analytics and helping the people I work better succeed. 25+ year data veteran, ranging from embedded systems to massive cloud based data lakes. My early career interest centered around producing 3D animations of Finite Element Modeled Elastic Waves. Career wise, I came for the data visualizations and stayed for the computation and data. Past roles have included: Solutions Architect, Data Architect, CTO, Engineer. Current Specialties: Big Data Strategy & Architecture, Data Lakes, Streaming, Delta Lake, Spark, and Databricks.