Portable Scalable Data Visualization Techniques for Apache Spark and Python Notebook-based Analytics - Databricks

Portable Scalable Data Visualization Techniques for Apache Spark and Python Notebook-based Analytics

Download Slides

Python Notebooks are great for communicating data analysis & research but how do you port these data visualizations between the many available platforms (Jupyter, Databricks, Zeppelin, Colab,…). Also learn about how to scale up your visualizations using Spark. This talk will address:

  • 6-8 strategies to render Matplotlib that generalize well
  • Reviewing the landscape of Python visualization packages and calling out gotchas
  • Headless rendering and how to scale your visualization from one to 10,000
  • How to create a cool animation
  • Connecting your big data via Spark to these visualizations

Data visualization is the only way most analytics consumers understand data science and big data. It’s challenging to visualize big data, and harder to get this to work across multiple open platforms. Double down on the difficulty for rendering 100,000 visualizations needed for ML Operations automation and data driven animations. Popular Python based Matplotlib, D3.js based, Bokeh and high density visualization packages and best ways to integrate those with massive data sets managed by Spark will be the subject of our presentation. We will demonstrate common strategies (image, svg, HTML embed) and gotchas common with integrating Spark, Jupyter and non-Jupyter environments. Headless data visualization strategies are used to automate Machine Learning Operations and data driven animations. A Python notebook will be the center of this demo. The strategies presented are accessible by those with a passing experience with Python based data visualization packages.


 

Try Databricks

Video Transcript

– Hello, everyone, this is Douglas Moore. Today, I will be talking about how to make your data visualization’s portable and scalable.

Portable & Scalable Data Visualization Techniques

Portability is important because imagine that you have a data vis on, let’s say on your laptop in Jupyter.

And you’re doing fine, you’re creating your analytics, sending it off to your customers and your boss and your teammates and you’re doing fine there. But then, the world changes, let’s say a company needs to move their data infrastructure, analytics, infrastructure to the cloud. You need to go on vacation, you need a backup. You need to automate this somehow. And just running as an overnight job. You need to scale up your visualizations. Not from one or two or 10 or 20, the thousands, 10s of thousands of visualizations, you need to store those visualizations along with your data. And your analytics for access via an application. You need to store this for audit purposes or approval purposes to prove that your model is valid. So these are all important reasons why you need to portable data visualizations. However, portability is hard. Every platform has nuances, there are basic assumptions with your setup with Jupyter notebook, it’s local to compute to Python, your CPU, and your data is local to your Jupyter notebook. And so some of the packages have taken advantage of these assumptions to help scale up to interactivity across a large data set, but moving to the cloud or another shared platform, those assumptions are broken. Security very important. There’s certain lack of security between your browser accessing local data, because that’s a trusted environment. But as you move in the cloud, a lot of these packages are gonna try to access data outside the browser. And that’s gonna get blocked, with course, header and CSRF type things. There’s also a difference in bandwidth between the compute and the display. There’s, again, you’re probably dealing with much more data. And so all these assumptions have changed. And this in part, ’cause a lot of packages have been built and designed based on certain assumptions. So a lot of these change as you move to cloud or another platform. And so that presents a challenge. So we’re gonna talk about strategies, about how to deal with that.

But before we get into that, let’s talk about Python data visualization, and scape this graphic is produced by Jake VenderPlas. He’s also the creator of this alter package, a declarative package, which is based on Vega light, which is, simplification of Vega, but also based on DC JS, very popular in JavaScript, very powerful, lower low JavaScript library. And then there’s a whole family of visuals based on JavaScript like Plotly and Bokeh are quite popular. Bokeh itself is, kind of a back end if you will, to HoloViews, very higher low, makes difficult tasks simple kind of package. It references matplotlib, which is universal Bokeh, datashader. Datashader is one of these libraries allows your job, visuals of billions of points because it has a couple of things going on as a very fast parallel processing with dast. In their covers, it also implements a number of visual summarization, however, I like to call. So these but again, you end up with images, rather than, say, interactivity there. Again, it’s a trade off. When you move to a browser, at the end of a VPN, you have to give up some your interactivity and your bandwidth. Do that. So and then there’s number of packages. So one I started out my career with OpenGL. So that’s a fun thing but so there’s a lot of libraries. This James Bednar he put together this blog, three part blog, “Why So Many Libraries” and he’s proposing or put forth this pyvis.org.

They’re promoting unification in the back ends pandas is near universal in terms of data science analytics in Python, and pandas has a plot back end and default there for many years is matplotlib. But with that work going on, now you can see your back end drawn with bokeh or HollowViews and HollowViews and that should by the datashader back end. So again, lots of changing, if you will, in the data visualization landscape. So let’s talk about some strategies here. So I’m gonna walk you through my demo notebook, which I’ll share later.

So the first one is, I call “Hope, Magic and Hints.” Again, you have your standard, seemed to have lost my pointer.

Seriously lost my pointer. Okay, here we go.

So, standard matplotlib, plot.show you expect to show up but again, things are a little different. So well with daybreak shifted you display this whole with no arguments to go out and look for that failure object and display here in the notebook. Different notebooks can different Soniliz, I think in Zeppelin and z.show.

Jupyter it’s display, so on and so forth.

Now there’s matplotlib inline with matplotlib to draw inside the notebook.

This tightly binds it to the, the IPython object. So we have that here, increase compatibility. Now the next major strategy I wanna talk about is to render as an image. Now, almost every package will support rendering as an image. Sometimes it’s simpler, sometimes that can get quite involving.

Strategy: ‘Render as an image to an in-memory buffer

We’ll see that a little bit.

And the first one is render to an in-memory buffer, I like these because you don’t name your temp file, you don’t have to delete it when you’re done. So here we do it, we save the figure as a temp file, we’ll give it a format. Now, if we want to actually view it again, you should always close your plots. If we wanna view it again, we can get those bits from that buffer, base64 encoded, use UTF-8, embed that into a little bit of HTML. Now we have an inline image. And then we can use work force, every package. Every notebook has, you save some form of display HTML here, boom. And you can see along the way, that HTML, we added a title. Alright, great. So we can render for some packages to, now it’ll buffer, but sometimes graphics packages, and I found some in that landscape that don’t support by Tayo or StringIO. So what do you do next? You have to save it as a file. Sometimes you need to plot that and store it for audit purposes, or you wanna embed it in Markdown or HTML. So in here we go. There’s this tracking API for ml, machine learning experiments called MLFlow. Very popular these days, open source package. So in here, imagine this code was for doing your training and other experiments. Now you wanna say you create some artifacts improve the worth of your training. And so now you just save that and you save it to a real file path because that’s what MLFlow is gonna need here. When you log that artifact. MLFlow it’s gonna log the file path as a key to it. And then going to take the contents of the image and log that. And now you can get to this, MLFlow and the tracking that it’s been doing, just by clicking on that link in the notebook, you can see all the metadata associated with your experiments, and artifacts.

And here’s our plot here that we logged into tracking API. So that’s pretty cool, to be able to track all your experiments and log all artifacts, and then again, that goes to reproducibility, auditability, and sometimes you want that image and you wanna embed it with other texts and other media, if you will, to describe it. So this is an example of embedding it inside of Markdown. You can embed in HTML and so on and so forth.

So I’m just gonna delete that. Alright, next strategy I wanna talk about is all this JavaScript packages on the left there that created HTML, JavaScript, inline JSON. I wanna talk about those for a while. So a couple of those Bokeh and Plotly, we’ll go those two. With Bokeh, we’re creating some sample data, some random time series here. We just, with the right package installed, pandas version 0.25 and this Bokeh. We just set the back end there. Again, there’s always some kind of flag or whatnot. Now, this will force this to render.

And when we do this render, it will create a bokeh object here but now to display within this notebook system, we need HTML. So we use file HTML and the CDN objects for all the resources to, all the JavaScript are there, and we call displayHTML. And now we have this here, so we went from pandas to data visualization, 95,000 points. Now the reason we can get to 95,000 points there is ’cause bokeh along with Plotly will embed the data in a binary form. Other packages like Altair and some others, when they embed the data they embed it as JSON, which is much less compact. So here’s an example with Plotly. And you can see with this notebook vendor Databricks, they now have native support for Plotly. Just do ts.plot and boom, you have your visual here.

So okay, as long as you set the back in on pandas, you’re good to go. Okay, so now, let’s say you’re doing animations and animations like matplotlib. I works closely with the the event loop in Jupyter hub, which is very close to the Python to compute there. We don’t have that luxury in notebook that’s in a web browser. The data center, where the Python is running is in Virginia, I’m in Boston. So that’s not gonna quite work here. So what we do is we do matplotlib, we build up our animation here. So this is our animation object. And now we just convert that to HTML. And then we display that in the browser. And we have this, animation. Now this is nice. You can embed it in web apps. Any of these techniques that generate HTML, you can embed it a web app. And I try.

If you want to dress up your animation, or sorry, your plot with additional HTML, you can do that. Now, since this is a Spark and AI conference, we should talk about data, lots of data and Spark. So imagine that you have your whole data lake and you wanna analyze it and Spark’s great tool to look at your entire data lake and look at structured unstructured binary data, so on so forth, and use a lot of Spark workers and Spark SQL or PySpark or something to filter the data that you wanna look at, and then summarize the data, take the aggregates. And then to complete this pipeline, you have to bring all that data, the summary data, the smaller data. You can’t bring your entire data lake and you can bring the summarize and filter data into your driver and then you have to run through data visualization package, and ultimately, if you’re gonna render in a browser, has to produce HTML, JavaScript, so on and so forth, boom. So you have a huge amount of data reduction. And way larger than what you have on your laptop. Especially here, but on the back end, your data lake is much larger than your laptop, much faster. So the main strategy here is reduce your data lake to a manageable number of data points that your users can actually visualize and interact and filter early and often. Again, you don’t wanna bring back too much and overload your driver, overload your browser. So here we go again with bokeh doing this stuff within the setup. So the thing is, this pandas data frame, we’re gonna fill it with data from Spark and workers.

We’re gonna have 10 partitions, in this. We’re gonna create some random numbers. And we’re gonna display that and we get that here. So that’s from the previous run. Let’s see, we’re gonna start this up. Only now you’re gonna use that cluster. Anyway so that’s how you go from your daylight through Spark, your driver to a pandas data frame.

With these two pandas, to your visual. So that’s how you connect all the dots. Pun intended. Alright, now, I don’t know about you, but I’m getting tired looking at this file HTML CDN display HTML. Just wanna look like that does work like it does a lot simpler. So to do that, install customer alert packages and the ability of the back end, if you will, and customize it or just redo it. So we’re gonna do that with Bokeh. This is our show doc book and take the bokeh object. And we’re gonna do what we’ve always been doing, But we’re gonna install this hook, these callbacks like this. Now, the only one I’ve gotten to work thus far is the show doc. I don’t know what the load does. The show app is for interactive apps, which aren’t supported yet. We haven’t figured that out. So we have this. Again, you do normal setting, you’d see something like this in Zeppelin and then you just render, you do a tsplot.

Again, pan this and what’s going on there, that unification of that you can switch out the back end by specifying the backend that you want.

Your go and their time series here. So that’s it for that series. Now we’re gonna go on to Altair. Altair is pretty tricky here. So again, we’re gonna render this as a PNG, actually, we’re going to skip PNG and SVG, and go straight to PDF. But you have to install Chrome, as a web browser on your driver node in the data center in Virginia, so you do this, there’s new package Altair saver, which helps things out. Of course restart your Python context. Again, Altair protects you from rendering too much inline data.

Default limit is 5000 rows. So if you wanna exceed that, you have set max rows to some number. I know my dataset is using about 50,000 points. And then I wanna say, verify the data format. So we have PDF, and SVG, PNG, and so on and so forth. So we skip over those two, and we’re gonna save as a PDF. Again, we saved chart target, it looks at the target, the file extension there, sees that’s PDF, saves it to this location file store, to boom, we have that and that’s available through displayHTML, we’ll get a reference to it.

And now we can download this. Of course, on the back end, we can automate that and email it off to someone.

And there’s a chart, nice thing with PDFs is it’s very, very reproducible.

So all right, and then the next one, which we don’t have time for in this series is rendering via driver proxy. With a driver proxy, you know how I think it was saying that Altair it can do auto band data access, it can go directly to your your laptop, an access CSV through JSON? To support that in a web environment, a secure web environment, you need to create a proxy. Proxy that runs on the same computer has access to that data, you save the data, and you look to you look Altair, JavaScript goes through that proxy. And that proxy will help you securely get access to that and that proxy will help deal with all the headers, the course headers and CSRF type headers and whatnot.

So but we don’t have enough time for that. So that’ll be saved, perhaps, for a blog post or something. Now, let me talk about Spark and generating plots within Spark Executors.

Generate plots within Spark Executors

So to do that first we’re gonna take the strategy of creating a user defined functions.

Now, you might do this. So to process, trillions of data points spread out across sensor logs that you have in your data lake.

So this function, again, this is for demo,

but are decaying sine function here parameters A and B. We’re gonna create plot of that. And so now we have a function of the (mumbles) plot. We’re gonna create a function and take that plot or that figure and convert it to an image. And we return this here. Now Spark now understands the image datatype inside of a data frame. It looks like a structure like this. So now that we have have that full and then from data to figure to image, again, it’s a headless rendering strategy. We have that wrapped up in this function called get_plot. Get_plot here in the image scheme, we’re gonna create a user defined function, a Spark user defined function. We’re gonna use that in Spark SQL as shown here. This is my unit test, like the test code and stuff. So we do that, pass a couple parameters, four and five and 500 points, wherever it became sine function here. And that was a Spark job that ran. Now, already we can generate, 10, 100,000, million of these, the different functions again, this is synthetic, but this could run over all your source data. All your sensor logs, if you will, and create individual plots and store these images. So now we have images inside of a data frame. We have data fields, we’ve projected from that metadata. And that’s gonna be important. So now with these plots in the data frame, of course, with Spark, unified API there, you can save that data, you can save the parquet, save the Delta, Delta tuition transactions. So you save this. And now you can go back and record it, you can optimize the Delta table. You don’t have millions, billions of small image files in cloud storage, which, will be slow the store and retrieve, but here, we can retrieve that based on metadata associated with the image. Or join it with contextual data. We’re gonna retrieve just the images that we want to perform our function or an analysis or other engineers could come and look up where what our computation model. So this is how you kind of scale that plotting out horizontally.

Again, it’s all based on the headless technique. So in summary, I provided a number of portable data visualization strategies, all the way from creating an image buffer and then manipulating that to an image file for strategies for working with HTML and JS type packages, like Plotly and Bokeh and being able to run through that across multiple notebook platforms.

Summary: Portable Data Visualization Strategies

And then also, because this is the Spark and AI summit, we have to talk about the data lake and how to how to address all that data of your enterprise and how to visualize that. The hooks, trading your own hooks to make porting so much easier and cleaner, all the way to headless Chrome browser. Back in the day, there’s a headless X Windows Server. And so one of these packages requires a headless X windows strategy to in the future, we’ll get into discussing how to establish an API proxy and and enable Altair and other packages like that. And then we shared how to scale out with Spark. Scale out horizontally to those graphics as visualizations needed across not 10 or 20 data sets or time series but 10s of thousands, hundreds of thousands of time series like similar customers have done. So and then next slide, I’m showing the resources. We have my GitHub account the slides will be posted there. I’ll cross post, there, here. They’ll be available from the organizer of this and my demo notebook. So you can run through this. And I’m listing references here for your reading pleasure.


 
Try Databricks
« back
About Douglas Moore

Databricks

I'm passionate about helping customers find value in data analytics and helping the people I work better succeed. 25+ year data veteran, ranging from embedded systems to massive cloud based data lakes. My early career interest centered around producing 3D animations of Finite Element Modeled Elastic Waves. Career wise, I came for the data visualizations and stayed for the computation and data. Past roles have included: Solutions Architect, Data Architect, CTO, Engineer. Current Specialties: Big Data Strategy & Architecture, Data Lakes, Streaming, Delta Lake, Spark, and Databricks.