Big Data for Engineers: Processing and Analysis in
5 Easy Steps Webinar
Want to watch instead of read? Check out the video here.
Hello, everybody. Welcome to today’s webinar. This is a joint presentation between Databricks and MathWorks, and it’s really based on the partnership that we’ve created. And we’re really excited about sharing what’s been developed and how using these two services and products together can help really deliver some compelling and powerful solutions for engineers and all about how we can make processing and analysis easy in five steps. And so we’ll go ahead and get into the next slide. And I’ll give you an overview of who our speakers are today. I’m Kevin Clugage. I’m part of the marketing team here at Databricks. I’m joined by Nauman Fakhar who’s our director of IFC solutions at Databricks, as well as Arvind who’s our chief solutions architect at MathWorks.
We’re going to introduce this solution that shows how Databricks as a platform can work with your MATLAB code, how you can configure jobs that push down those big data queries into the Delta Lake, which we’ll tell you what that is. Ultimately, we’re going to cover a couple of different use cases that cover how to analyze massive datasets in the cloud, how to deploy and run your algorithms to those cloud compute clusters with sort of a very easy and seamless management interface. You’ll get to use your MATLAB toolboxes with Databricks and Arvind in particular is going to be showing a lot of these features in several demos that we have planned for today’s session. We’ve got time reserved for Q&A at the end, again, feel free to put in your questions along the way. And we’ll be tackling those at the end of the webinar.
For those who aren’t familiar with Databricks, I’m just going to give a quick overview. Databricks provides a unified data analytics platform that’s built for data science, data engineering and business analytics, and ultimately helps those data teams accelerate their innovation. Next, we’ve got a global customer base now of over 5,000 customers, as well as hundreds of partners. Some of the things that are well-known about Databricks are the open source projects that we are the original creators of. Apache Spark is the world’s largest open source big data project. And since then we’ve also created two new open source projects. Most recently was Delta Lake. And that’s an open source project about bringing reliability to data lakes at scale. And then ML flow is another open source project that helps manage machine learning life cycles end-to-end.
All of these open source projects are also part of the Databricks platform that you can use as a cloud-based service. And you’ll see that today during the demo. Now, I’m actually going to turn it over to Nauman Fakhar who’s going to tell you a little bit about the unified data analytics platform itself and what it’s made of. Nauman, over to you.
Databricks Platform: Unify data, analytics, and AI
Thanks, Kevin. I’ll just give you an overview of the overall architecture of the Databricks platform. And you want to read the figure above, bottom up. When our customers are on the journey to build a cloud native data lake, the very first thing, the way the data lake really starts is your raw data. Which is what you see at the bottom here. And most customers would want to put their raw data right on the native cloud storage of the underlying cloud provider that they’ve standardized on. Namely, if it were AWS, it’d be S3. If there were Azure, it would be ADLs. It would be if it was Google it would be Google cloud storage. And that’s an important thing to remember because what Databricks does, as Databricks sits out of as a platform of data processing, data engineering and machine learning platform on top of your raw data lake, the raw data that sits on top of let’s say S3 or ADLs. We don’t force customers to ingest their data into any kind of a proprietary database or a data store.
The customers remain in full control of their raw data and we act as a compute layer right on top of S3 or ADLs or the underlying cloud storage. And that’s an important architectural construct to remember because customers remain in full control of the data. Then the platform, looking at the first layer of the platform, what will be call the enterprise cloud service. What we’ve done there is we have abstracted out how a customer consumes the raw compute storage network capacity of the underlying cloud. They are very easy to use UI. The time to value in terms of getting a large-scale cluster of three minutes, with Databricks that a customer tells you why he gets a cluster to go about doing their data engineering or machine learning needs and processing.
Then in the middle sits the actual engines that drive all the analytics. Spark, as Kevin was mentioning, is where the core engine that drives a lot of the data engineering and machine learning workloads. It’s an open source project. We were the original creators of Spark. We still contribute 75% of the code to the open-source project. The version of Spark that’s on our cloud products is, from implementation perspective, 10 to 100 X faster than that open source Spark. And we put in a lot of engineering and best practices that we’ve learned over time into our flavor of spark which is available right on our AWS and Azure cloud platforms. And of course there’s Delta, and I’ll go into much deeper detail on Delta in about a minute here as an engine and its importance in the architecture.
And then what sits on top of these processing engines is what we call the data science workspace. And that’s where these personas come together; your data engineers, your machine learning engineers, your data scientists, and your analysts who are more SQL savvy. We all come together into the single unified platform and workspace so they can collaborate and do their big data projects. And speaking of data scientists, we’ve also got support for the popular machine learning frameworks like TensorFlow, scikit-learn and we provide managed versions of these frameworks which are tightly integrated with our version of Spark and all of this comes together as one single unified managed platform where customers focus on analytics as opposed to figuring out how to become DevOps gurus when it comes to big data.
Delta Lake: Make data ready for analytics
We don’t force customers to become DevOps gurus. We do the heavy lifting of just scaling out the environment, providing a managed environment so customers can focus on analytics.
Going a little bit deeper into the importance of Delta and how you’re going to see an interplay with the demo that Arvind is going to show it to you folks today. It’s customers had expressed a desire to build cloud native data lakes. What we saw them struggle with was two key technical issues. One was the notion of reliability and second was no shared performance. It’s no secret that cloud storage is very economical and cheap and it controls a lot of data, but just by turning the data into cloud storage doesn’t mean you have a cloud native data lake. You need an engine and a data format that can actually process that data reliably in a performance matter.
And what do we mean by reliability? We’ve added a notion of transactions to large scale data engineering. Delta Lake as an engine and as the underlying data format, which is based on open source parquet, allows you to manipulate data at large scale in our asset transactional manner. What that means for example, is if you’re processing a billion rows and something goes wrong at the a hundred millions row, Delta can automatically detect that and then roll back that transaction so that a data scientist or an engineer or an analyst doesn’t end up reporting or doing analytics on half and just the data or dirty data or corrupted data, and the platform just works or figures out that things have gone wrong. They’ll roll back right, automatically. That’s a very, very big engineering problem that is a difficult one that customers manually have to deal with prior to Delta.
Delta just takes care of that for you. As a flavor of how Delta performs. As you go to the cloud, you get to decouple compute from storage. This is a beautiful architecture because you don’t have to scale compute out in the same way as you keep on adding more data, because you’ve decoupled compute and storage. However, by doing that, you’ve introduced a network and compute and storage, and there’s a performance hit that you take as a result of that flexibility. Delta takes this thing out of that performance hit by doing a bunch of things automatically underneath the covers. It’s got a smart autonomous cache. It can figure out what is a hot data set and bring it automatically into local NVMe SSD on the underlying cloud and it will just automatically do that.
It will automatically compact small files into large files in the background. As you know, big data engines are much better at dealing with large files as opposed to many, many small files. And there’s a lot of other optimizations we’ve done where Delta workloads will just come and run at a very large scale without the customer having to do the heavy lifting. And that actually allows you to build an enterprise data lake on which you can then do your data science machine learning workloads with tools like MATLAB and have the peace of mind that you’re actually working with high quality data. Because it’s garbage in and garbage out. Your algorithms are only as good as the quality of the data that you feed into them. Next slide, please.
Coming to the joint value of what the integration is bringing to join customers. If you look at the Databricks platform natively, there are sort of four key programming languages that are an interface into Databricks. Python, SQL and Scala. However, when you look at domain specific problems, engineers and SMEs who’ve got expertise in the respective domain they operate in. If you’re, for example, a mechanical engineer, a chemical engineer, or an industrial engineer or a digital signal processing expert, you want to be able to take advantage of all the power of the new cloud data lake paradigm and be able to act or apply your domain expertise to very big data, large data, but without having to learn things like Python, Scala and the other programming languages, that’s a horizontal big data platform like Databricks supports.
Distributed big data processing and machine learning with Databricks and MATLAB
And that’s where the power of MATLAB, which is a platform that those domain experts are very, very familiar with all the way from the time that they may have been in college or in academic environments to going into industry. If you’re figuring out, if you’re in the IoT space and you’re trying to figure out what the lifetime of a engine is, if you’re in the automotive space and you’re trying to simulate drive train up the car, you’re going to be using tools like MATLAB and all the toolboxes that you were familiar with being a domain expert. And what the integration and the joint value does is it’ll allow you to stay in that environment, in the MATLAB environment and still be able to take advantage of all the large datasets that you may have curated inside Delta Lake and allow you to do distributed processing on extremely large data sets, but from that familiar environment, so you don’t have to relearn a new programming paradigm.
You can work within the MATLAB context and bring in that domain expertise to the world of distributed big data processing and distributed machine learning. And that’s really the joint value of this integration, which Arvind is going to go into a lot more detail. But that’s where we are bringing today, expanding the persona to the domain experts so that they can take advantage of the power of Databricks.
Diving a little bit deeper into the joint value prop of how these two platforms come together–see figure above. On the Databricks side, what we are really good at is processing data at an extremely large scale. And we’ve taught the core engines to make that happen. We had things like our flavor of Spark, Delta Lake. We’re able to process data, which is at about 10 to a hundred X faster than what you will see in an open source Spark at petabyte scale. And because we’ve baked in all of our best practices on how we’ve learned to scale Spark workloads on the cloud by supporting thousands of customers.
Our engine is extremely efficient to re-write many of the primitives in our version of Spark on the cloud that used to be an open source, but now are in our proprietary version. And as a result, what customers experience is a much, much lower total cost of ownership of doing big data processing in Databricks, in auto scales it’s a fully managed service. They don’t have to tune a thousand different Spark parameters to make their workload scale. The apps just work. We’ve gotten much, much better memory management, much, much better way of, for example, doing joins and a lot of other engineering that we put in to just make it work without the customer having to do a lot of tuning.
And because it’s a unified collaborative experience, we’re able to bring these personas together, data engineers and data scientists. And now with the MATLAB integration we’re able to bring and integrate that persona of the domain expert, the domain engineer who’s wanting to apply his or her domain expertise to the world of big data. And I’ll hand it off to Arvind to speak about the value of how MATLAB is working together with Databricks. Arvind, to you.
Thank you, Nauman. MATLAB and Simulink as a technical computing platform is a very familiar tool in most engineers’ tool boxes. It’s the easiest and most productive computing environment for engineers and scientists. It’s all from where they do their best work. A wide array of domain specific toolboxes, actually great development, leaving these domain experts to work at a level of abstraction that allows them to sort of fright less and do more. That coupled with a rich set of data science tools and applications to process data, build models, and deploy to the cloud, make it a very attractive environment for engineers and scientists. As a part of the demo today, a demo is worth a thousand words. And what I’m going to do here is to go through a workflow of what the tooling or the integration looks like starting with managing Databricks clusters and the platform, exploring and investigating a big dataset that’s sitting on cloud storage.
Going through the actions of data preparation engineering. This is usually a very painful exercise in real-world problems. And I would try to highlight some of the problems that you would see and how MATLAB would help with that. Then some algorithms, I built some models, deployed and pushed down to run at scale on Databricks Spark. To get started, as a workflow, we will connect from MATLAB to Databricks, handshake with the tool, explore and execute cloud queries and review and analyze the results in MATLAB interactively so that you could actually see what’s happening. As you develop it in analytics. Create a Spark, submit a job in MATLAB, and then push it down to execute on a cluster at scale. There are two demo data sets that I’m going to talk about. Both of them come off fleets of vehicles.
Use case: applying analytics on vehicle fleet data
In this particular case, a fleet is a group of capital resources that generate measure data. This could be automotive, things like vehicle engineers. It could be manufacturing, energy, agriculture, healthcare. A bunch of the work that we’re doing today has been with the support of a large Midwest auto supplier that is actually doing this with the real problems and doing real engineering and analytics with this platform, given the sensitive nature of their data and their problem, we can’t really talk about it in a public domain, but what we have as MathWorks is we constructed our own fleet, that collected data so that we can put it up in a public setting like this. And what you’re seeing here is colourised trips across many drivers, colleagues of mine from the MathWorks that plugged a little piece of electronics on their car and we’re logging data to the cloud.
This could be the AWS or Azure cloud, but essentially pushing data up to cloud storage. And in this data set, you see a lot of very interesting things. You can see patterns of how people get to the office. This for example is where our office is in Novi, Michigan. There’s data from the Bay Area, data as far as from Poni in India and this data being a real world data set exhibits all of the problems. I can assure you that none of our drivers who are riding across the freeway are driving through buildings. It contains a sort of messiness of what a real data set looks like and gave us a small fraction of what it would be like to be in the customer shoes.
Customers can apply these analytics on fleet data, either as batch processing or as near real time, if you’re using Kafka. Today we’ll be talking about the batch processing and how you actually take these little bits of data that build up to large, big data problems and scale it up and execute it, execute analytics against it and build models and execute against it using Databricks. As a summary of our data set, one or the two data sets, we have about 1,300 trips from different durations, there’s data from close track and open road. Some of our colleagues race in Waterford Hills, they race on close track so there’s some very interesting data sets of cars moving very much faster than legal open road speeds. While we were collecting there were about 27 unique vehicles, 30 plus sensor channels. In over a year and a half, over a million and a half data points as the data set built up.
Demo: MATLAB and Databricks
Watch the demo here, starting at 20:48.
With this, I’ll switch out back to a demonstration to show you what this looks like. I do believe a demo is always worth a thousand words. I am going to split my screen. On the left is MATLAB and on the right is Databricks. I want to sign into the platform. I’m going to keep it side by side. I can actually maximize this, but it’s good to show you what is happening on the Databricks platform as I go through the demo. To start off, let me start with a full workflow put out as a simple script. The second data set that I wanted to introduce is an airline delayed data set. Any big data scientists. I put this in as a demonstration because if you wanted to replicate what I’m showing here today, this is a publicly accessible dataset and there’s a citation for relatives. Pointing to the data and the dataset, I can take a deployed MATLAB application specified as a Spark application that I wish to run on the cloud. Specifying where I would like to store the results is a timestamped output folder.
Feeding a cluster definition with an appropriate version of the runtime and what kind of hardware that I wish to run at, allowing me to set a scale, making it a MATLAB capable cluster. It takes a line of code to make sure that the stack is capable of executing MATLAB analytics. Configuring the job, and if necessary notifications of when the job starts, it succeeds, it fails and defining a task, what to do as a part of the job, which is to run my application. I can specify the input location and store my data as an output location. Configuring a job like this allows me to create it, execute it and refresh. If I were to run the script, and don’t blink now, it takes a few seconds for my MATLAB to communicate to Databricks. Databricks will pick up the new job. If I were to look at my jobs here and refresh, I spun up my new job and I started it, it’ll in turn spin up a cluster. As I refresh this, you’ll see a data engineering cluster up here.
Spinning up the cluster makes it MATLAB capable, launches the MATLAB application on it, runs the MATLAB application, stores the results, and gives me the ability to pull and pull for status and make sure that my job is completed. Send any notifications if necessary and close up. I let this run, but as a quick sort of high-level view, this is what a part of the workflow could look like for data engineering. The what is clear, but the how, to describe the how, let’s go a little bit more systematically and describe what this does. For starters, it’s possible for me to create clusters on the Databricks system, for example, in just five lines of code. I’m going to define a cluster. For those of you who are not familiar with MATLAB, MATLAB comes with the toolstrip.
There’s several widgets that you can place on as an ID for looking at your current folder, if necessary. Take a look at the variables that are in the current workspace and previous commands that you have invoked. As an ID, it gives you a graphical engine bolted onto it so that you can visualize stuff. And this is an idea that many of our engineers are very familiar with. It comes with the ability for you to edit. You can build your entire software, be it script functions or you could actually use features like live scripts to give you a notebook experience. And in this particular case, you can either execute this block itself by running the section or I can paste it out into a command window or kind of a shell for our product to work.
In this case, I’ve created a new cluster. I call it a webinar demo. And if I enable the MATLAB run time, what I can do is see a fully configured MATLAB object here that has been configured with the right Spark environment variables to run as a MATLAB job. Many of the ID features such as setting the size of the cluster can be made very interactive, but it takes only about four or five lines of code for me to create this cluster.
When I create this cluster, MATLAB communicates through Databricks, and you can see my clusters spin up. Events on my cluster are visible, both from MATLAB and if necessary through their web interfaces. Essentially what you’re seeing on the left and the right are kind of in sync with each other. It is possible for me to take a look at a cluster like this, and refresh information about it, seeing what state it’s in and building my software layers on and about that. Not only can I create these clusters, I can actually start, stop, terminate and delete them. In this particular case, I’m going to delete my current cluster.
And by doing this, it sends my webinar demo to a terminated state or if necessary permanently delete it. Take it off the list of my clusters and it disappears on the left-hand side. With that said, I can define what kind of knock types I’m running my algorithms on, which version of the time I’m using. And all of that becomes available as properties. The way it works is that it connects up to the restful endpoints that Databricks exposes. And there’s a client and MATLAB that talks to that that allows you to control the Databricks platform from very simple to build MATLAB script. Like in this particular case, I can actually create a new token and use this for authentication information, put this into scripts if I require to, I’m actually exposing it as a public webinar. But that can ducktail with security best practices.
For example, if I refresh this, you’ll see the new token that I just created, the sample token here, which I can then revoke. The ability to compost software and ducktail in your security best practices is very much possible. The next bit of the platform is access to the underlying storage area. For example, if I took a look at DBFS this abstracts away what’s on your underlying cloud storage, be it SV or Windows Azure Blob storage. The ability for me to actually take a look at the contents of that from within MATLAB allows me to actually look at files. And in this particular case, I see that I created a folder called webinar in which I’ve actually packed a whole bunch of party files that contain my dataset. It’s possible for me to download and upload data. The uploading interface is useful when we actually build our MATLAB application and push it up to make it available to the cluster. As job operations, you can define these clusters, configure it and push it out to the cluster.
We’ll get into a little bit more detail here as the job is currently running. But essentially at this point, it should be possible for me to connect up to the data sets that I have here and let’s switch to the next part, which is how we explore those data sets. Databricks provides a module called Databricks connect that gives the API a spark running locally. Any dag that’s computed in this environment gets transferred, a logical representation of it gets transferred over to the Spark monster on the other end. Runs on the Spark cluster on Databricks. To demonstrate this, what I’ve done is actually created a cluster.
I’ve created a cluster and the cluster gives me the ability to … It allows me to spin up a cluster for the demonstration. And usually when this spins up, it spins up as spot instances on the cloud. You can see the events that are currently happening and with spot instances, sometimes it gives you a very resilient cluster on the backend that can actually execute the MATLAB analytics. In this particular case, the notes were lost because of the underlying cloud sources reclaiming them. It’ll go through the processes of actually resizing and bringing the cluster back up. But to give you an example of how this works, we can create a shared spot context. Once we have the shared spot context, it gives me the ability to actually point to my input data location and get a spark context in MATLAB.
This can be done interactively. I’m waiting for my cluster to start backup because the notes were lost just a few minutes ago, but essentially once it comes back up, I should be able to show you how to create a Spark’s dataset, giving you the way of interactively reading data, counting records, if necessary and using the Spark API to take a look at my data sets and slice it, if necessary, inspect my data, inspect it in the MATLAB language and use credits to slice and dice my data, make it available in the MATLAB context. At this point, it’s possible for me to take a particular trip and visualize it in MATLAB. This is a drive cycle of one of my cars going through a start, stop commute. What is very useful is also being able to use the large number of toolboxes to do more specialized analysis on this.
For example, I can very quickly see what the trip looked like as I went through, connected up and this is one of the cars going from our office to Farmington High School. But along with that, look at events such as when did the car stop and begin to start studying things like congestion, being able to actually connect up and see condition events on a geographic scale. It becomes similarly easy to read and write from Delta tables. Reading and writing from the Delta table gives us a much faster as one of the reason that I’ve added a little bit of instrumentation here, a much faster way to read in and out of the underlying cloud storage and if necessary parts of that can be pulled into lab and stored locally if you’re doing any kind of forensic work.
While we’re waiting for the cluster to come back up here, I will talk about …. One of the big problems with doing this kind of data analysis is that there’s studies that show that a majority of a data scientists or data engineers job is spent in model preparation, in preparing the data set for model development. In connecting up to the data, it gives us the ability to not only explore the data, but handle the many kinds of dysfunctional or messy data. For example, we don’t know if this is multiple trips or data dropout between a single trip and there are missing data channels. There’s partially missing data. Now MATLAB offers a whole bunch of functionality as a part of it’s math, science, and optimization toolbox features for dealing with missing data, for filling in, for extrapolating, for resampling, for timing, very high-level functions for you to deal with the data quality problems.
Like in this case, we can fill out the missing data. If you have outliers that look like this, that are very team functions to fill out the outliers, to give you a pain free process data, all of this can be actually packaged into data engineering jobs and kicked across, run across your entire dataset. In many cases, there is GBS data. And to give you a sort of very high-level view of this particular data sets, if you look at the missing data, invite and the good data and black, you’ll find a whole bunch of channels missing across the data set, essentially when you see something like this happen, it gives you an overview of the dataset. Some of this is intentional. Something like a fuel flow would mean nothing to an electric car. But some of it is due to the way we are actually collecting the data and real world problems. Given that our Databricks cluster is back up, what I can do is very quickly show you what that looks like.
For example, I’m going to import my ML Spark libraries, point to the input location and create a Spark context. Once I have the Spark context, just tick-tock, I can actually pull in data, timing and just a read of all my underlying data entering my schema and looking at the header, loading my input location which has all the data in parquet files. And if necessary, this takes about maybe 10 seconds to read, about 16 seconds to read and look at my nearly half a million records here by just counting it. Every one of these can be seen as jobs. If I take a look at the job lists on my Databricks cluster and refresh this, you’ll notice the jobs begin to land. For example, if I sliced my data and brought it into MATLAB, I can take a look at a slice of my data and if necessary filter down to a single trip of interest. I’m running this interactively and I can put this into a script if required, but as I’m doing things like this on the Databricks end, you will see that it’s actually composing Spark jobs.
Job one has become two, three, four. It’s actually building and slicing and dicing and leveraging the Spark cluster to actually pull in your data. At this point it’s possible for me to bring up things like visualization of my dataset. It opened up on a different window, but here’s my drive cycle. I can duck this into my window if I needed to in kilometer per hour drive cycle as I actually go through. But you kind of get the idea of how the data exploration experience can be a very interactive experience. The last advantage in using something like MATLAB for doing this, or why our engineers like it is, if I were to look at this as a problem and say, “I’m trying to build a model that given X seconds of data, I would like to quantify something that we sense or feel. We’d like to build a model to quantify driving behavior, creating a set of bins saying, is my car going through a panic braking or heartbreaking event or is my driver driving it like my grandma. Is it an aggressive driver or is it a very sedate driver?”
And our goal is to look at something like this and say, for a given drive cycle, for a given slice of data, can I classify what kind of activity. Most of my it’s a sedate driver, but there have been events in our datasets for people narrowly missing deer in near Boston and whatnot. That becomes very evident in the dataset. This can be deployed as a part of the processing to show you the features of MATLAB that actually enabled us as opposed to reading off the parquet files. If I actually did the same read via Delta Lake. The ability for MATLAB both to define a Delta location, which is the time stamped table and light this out. This takes a few seconds.
But what you’ll notice with the entire data set when written out to Delta format actually performs, reads and writes a whole lot faster. What took me something like 16, 17 seconds to actually read raw parquet files moves a whole lot faster when I leverage the features of Delta league tables. This particular analysis, this particular sort of acceleration of queries becomes much more evident when you’re actually dealing with much, much larger datasets. Big as we think this is, this is not really as big as the kind of data sets that many of the people who are attending here have at hand. That performance improvement may take about 50 seconds to complete. But once it is in a Delta Lake format, reading it as opposed to taking 16 seconds takes about 2.5 seconds.
It’s almost 10 X faster, actually more than that, about 20 X faster here. Pulling in a data set that has been pre-labeled and allows me to bring in a data set. And once the data set is in MATLAB, I can actually leverage the huge set of … I bring this in as a variable moderator, but I got a whole set of applications in MATLAB for domain specific operations. In this particular case, if I was building a model like this, I would start with something like the classification learner. The classification learner is an app that is available in MATLAB that’ll allow you to take the given data to start a new session, gives me the ability to point to my variable that I just pulled them from Delta Lake and take a look at what predictors I would like to use specifying what part of my data set I’d like to hold out for validation as opposed to training.
Starting a new session gives me the ability to not only visualize my data, but if necessary study it, take a look at, say, for example, how much fuel was being used by speed, but start my machine learning by using a whole showcase of machine learning models. Let’s say, for example, I pick all quick to train models and if I train this, it’ll go through the process of actually very quickly telling me the kind of accuracy I get with different models. Apps like this, even help you pick which model structure works for the data at hand. For example, if I picked a different model structure like linear SVM and trained it, I get a very quantitative feel as to how it performs against my given datasets, allowing me to pick the right kind of model. This can be exported as a model itself, or if you wanted to, you could generate a function and this becomes the hype of what you take to Databricks and scale out. The ability for you to take your algorithms, it gives you a starting point for you to build your code.
The last step of this process, once we have our model in place, is to push down the analytics. And in this case, I’m going to go to a dataset, back to the headline data set, which is a hundred X as large. MATLAB offers semantics for handling these big data sets. One that is familiar to Spark developers. That’s the MATLAB API for Spark and one that’s familiar to MATLAB developers. To illustrate this, for our data sets for the given input, I can abstract that stall areas, my data store and use MATLAB algorithms.
This abstracts away the user from knowing the underlying intricacies of Spark, MATLAB will do heavy lifting, building the bag and working on this platform. The other semantic for people that are familiar with Spark, you can certainly get access to the RDD and get access to the Spark context, use all the actions, transformations that are available as a Spark API. It involves you knowing this. We believe in freedom of choice. You should be able to actually pick for the problem at hand, the right approach, the right technique and based on the skillset of the person using it, we enable both of these, both of these will run on Databricks Spark.
But once you have your algorithm in place, it’s possible to compile this into a standalone executable. And that gives us a jar file. Using the API will allow it to move it up to Databricks and once uploaded, I can get the input. I can run the script that I started this demonstration with, which is to define my cluster, execute it, see it run. And when I’m finally done with it, I can take a look at my results. You should see a timestamp result at 9:23, which was when I started this. It should be possible to get the latest result. Okay. Let me just download this manually. You’ll notice the Part file here. Oh, it already succeeded.
The Part file here gives me the ability to sort of pluck my reserves pulling in the Part file. And what I see here is my question that I asked in my dataset. Who is operating flights in my data sets? And this is a total of 120 million rows that took about four minutes to execute at the start of this demo, showing that Delta Airlines, UA, US et cetera are the most major airlines in the US. Not a very interesting result, it’s probably intuitive, but it gives you an idea of what the tooling looks like and how to actually scale up to Databricks Spark.
Scale your data analysis process with MATLAB at Delta Lake
We talked about slicing and dicing the data using Databricks and connect, and eventually preparing the model for development. Performing the analysis and deploying and pushing down analytics to run on clusters, analyzing a datasets with about 120 million rows. There is a reference here to the data set if you want it to reproduce this example. But the ability to ask both data engineering questions, as well as data science questions pushed out by analytics gives us a key takeaway, which is that MATLAB brings subject matter to experts, and data finds users at cloud-scale. There’s a trusted array of MATLAB algorithms and tool boxes in the cloud, and there’s no need to reimplement or translate this. You can actually run it at scale on Databricks and all the data becomes easily and quickly accessible to MATLAB, to Delta Lake. You can access multiple file types, exchanged images and laid out, and a bunch of other types, both for labeling as well as for data analysis and data sites on the cloud.
At this point, we would like to open up the questions. To get started please write to us at email@example.com and you can try Databricks for free.
What version of MATLAB is required for the integration with Databricks?
What I showed you here is the current shipping version of 19 B. You can get it running on earlier versions if you have to, but we would recommend 19 B and beyond.
Is this feature part of a MATLAB toolbox or is it included in the base product?
That’s a very good question. I think we’re currently figuring this out, but as it stands right now, we call it a private preview. If you need access to this, we’ll be happy to give it to you. Please contact your account representative and reference this. We will make sure that you are supported.
Is the large data brought to MATLAB in memory?
You can actually run this in the classroom completely. If something goes wrong, it would actually involve reading through tons and tons of log files to understand what went wrong. What I showed you here today is that it’s possible to get small forensics slices to make the development of the analytics a whole lot smoother, a nice effective way of putting breakpoints, stating your math and seeing it, and then scale it up in the classroom. You can get it to run on the classroom scale. If you did want to run your experiments against the entire datasets itself, that is very much possible. There’s really no reason you could not. It just would make it a lot easier if you had a digestible chunk of data for you to build the actual map around it which you can then scale up, but both approaches are perfectly possible.
What is the advantage of using MATLAB rather than using Python libraries?
In 2019, there’s a whole bunch of choices you have for data science, but we believe our environment gives a very clean way for domain experts, not necessarily software experts to get into machine learning, to get into data science, some of the apps like the classification learner app that I showed you here will allow you to answer questions such as which model works best for my problem at hand. And that would involve a whole lot of work in experimenting with different models, if you are doing it with Python. So we suddenly see apps and the whole ecosystem of domain specific tool boxes as a differentiator of what you would get access to in Python. In some cases, the kind of engineering problems, engineering problems are slightly different beasts. These machines, the most expensive I’ve seen, produce two gigabytes of data a second.
These data sets can be large. They can be a very aggressive analysis that needs to be done. And when it comes to performance, MATLAB is very, very much faster in terms of as a kind of a contentious statement, but our initial benchmarks, when you looked at influence on something like AlexNets when compared with TensorFlow was about seven X fast, but now they’re about two X only because TensorFlow is getting faster. It gives you accelerated performance as well as very clean workflows to do your data science machine learning and AI. There are a whole bunch of other distinguishing features too, but essentially it is a platform that is very well trusted by engineers for doing the work with engineering data sets.
We talked a lot about Delta Lake, the question is, does data have to be in Delta Lake to be read into MATLAB?
No, it doesn’t have to be in Delta Lake because Databricks as a platform can act on pretty much any data set that’s on cloud storage. Delta Lake is a preferred data lake format because once you get into very large scale processing, there are a couple of things you want to be mindful of. One is you want a columnar compressed format which can actually scale out at larger volumes and Delta underneath the covers is basically a bunch of version parquet files. And so at larger volumes they will scale out way better than for example something like CSV or some sort of non-analytical format.
And a second is the actual engine. There’s a format in the engine and then Delta has got an engine piece to it which can act on those parquet files at extremely large scale. It does a lot of metadata [inaudible 00:54:24] underneath the covers to scale out the analytics that are at runtime. No, it doesn’t have to be in Delta, but what we find is that customers build out large data lakes, they automatically gravitate towards Delta, but it just helps them to scale out their analytics much, much faster.
What is the cost involved to include MATLAB in Databricks if we already have a Databricks account?
The execution of the matchup analytics and what I showed you here around from the MATLAB, around MATLAB run time on the Databricks cluster, and that should essentially be available off our website. You could do it today if you wanted to. It puts it on par with the running cost of Python. Now for building the modules, for developing the algorithms, there is licensed software which is MATLAB, and it’s tool boxes. And to maybe touch upon one of the other questions, there is a second flagship product of MathWorks called Simulink. Certainly being able to use the simulation capabilities of MATLAB in data science is one of the key distinguishing features. It allows us to tackle, say for example, classification problems so that we don’t have failure data. You can simulate the failure data and bring synthetic data along with your real data to strengthen the quality of your models. Being able to actually work across both MATLAB and Simulink has to change from the MathWorks for tackling these kinds of engineering problems is what we were here to talk about.
Can you speak a little bit more about compiling that MATLAB code for deployment on Databricks?
Correct. The MATLAB compiler allows you to take your MATLAB code and compile it to an executable that runs on the freely available MATLAB plan time. As far as running costs are concerned, it’s on par with Python. I would say it certainly won’t cost you any more than what a Python implementation would cost. But with that said, the process of actually compiling down to an executable is down on the MATLAB compiler, compiled with SDK, and that is a licensed product from MathWorks.
Can you use this to process data in the Azure cloud using Databricks? How can you process any data that’s in the Azure cloud through Databricks?
Yeah. And I think the question might really be that, can you use a MATLAB with Azure Databricks, if I understand it correctly. Yes, absolutely. The integration is designed to be cloud agnostic. DB connect, which is the key piece of how MATLAB is talking to Databricks, is an API or an integration that will work both on our AWS product and in our Azure product. And if the question also implied that, hey, I may have data maybe an ADL Blob or other data stores in Azure. And would I be able to use the power that Databricks provides to connectors to be able to process that data. Yes, absolutely. Because you pretty much have the entire API on your fingertips as a result of the integration. Yes, absolutely.
How does this environment support large-scale time series data?
Yeah. The MATLAB toolboxes have a whole bunch of features that are focused on time series data. There are constructs in our language for dealing with timetables if necessary and there are a whole bunch of functions for resampling for actually working with time series data. Because of the richness of semantics that you get from the MATLAB language for dealing with time series data, we believe that it’s a very powerful platform for doing time series analysis. In both the examples I had today were time series based. And there is certainly a capability in the MATLAB platform that they can leverage. That you can leverage for doing time series based work. All of which will actually go down to Databricks. Does that answer the question?
Right. And this too after that and then the scale-out aspect of it if you have that data in Delta will serve you well, because that’s the computations get converted into essentially data frame operations on Spark, you get all the scale-out capabilities even if it’s a time series data with the Delta combining the two together.
That’s right. There are a number of other questions, we’re unfortunately already past the top of the hour. We’re going to have to pause at this point, but keep in mind there’s a contact email firstname.lastname@example.org. Feel free to reach out and engage the team here if you have questions that weren’t addressed. Thank you very much for participating in today’s webinar. Thank you very much, Nauman and Arvind for presenting and giving a very thorough demo. We look forward to seeing the solution develop over time and look forward to having you join us in future webinars.