Getting Started with Databricks SQL Analytics

May 27, 2021 11:00 AM (PT)

Download Slides

It has long been said that business intelligence needs a relational warehouse, but that view is changing. With the Lakehouse architecture being shouted from the rooftops, Databricks have released SQL Analytics, an alternative workspace for SQL-savvy users to interact with an analytics-tuned cluster. But how does it work? Where do you start? What does a typical Data Analyst’s user journey look like with the tool?

This session will introduce the new workspace and walk through the various key features – how you set up a SQL Endpoint, the query workspace, creating rich dashboards and connecting up BI tools such as Microsoft Power BI.

If you’re truly trying to create a Lakehouse experience that satisfies your SQL-loving Data Analysts, this is a tool you’ll need to be familiar with and include in your design patterns, and this session will set you on the right path.

In this session watch:
Simon Whiteley, Director of Engineering, Advancing Analytics

 

Transcript

Simon Whiteley: Hello, my name is Simon Whiteley, and this is getting started with Databricks SQL analytics. So I run a consultancy called Advancing Analytics and we help a lot of clients who are taking their traditional warehousing and trying to build it in the cloud, trying to use data lakes, trying to use Databricks to do a lot of the transforming and preparing of the data. Well what happens when you try and show it to the business? How do people try and do that kind of BI style queries in that architecture and SQL analytics can really help with that. So I’m going to run through what the tool is, how to get started with it, where to go with it once you actually open it up. So stay with me. Okay. So we’ll talk a little bit about this lakehouse challenge. So the lakehouse is this emerging architecture of saying, “Well, we don’t need different platforms. We don’t need something just for BLI people and something for the lake and something for data scientists. How about we just do it via one thing?”
I’m going to talk about what that is and what that actually means for what our tools need to do. We’ll have a look at SQL analytics itself, where it came from and step through some of the key areas. Essentially, when you open it up, what are you going to see? And how do you work with these different things? What do these concepts mean especially if you’re coming from the traditional Spark background, because a lot of it will be familiar but different.
So let’s talk about this lakehouse challenge, this emerging architecture, this idea of saying, “I just want one platform to rule them all, one thing to do everything, but do it well.” And what does that mean? So this is where we’ve been for a fair while now. The thing called the modern data warehouse and it’s an architecture that loads and loads of people have built already in the cloud. Now I work a lot in Microsoft and so this is all my Azure architecture, but you could do a similar thing in ADwest same in GCP. You could do the same on Prem. Essentially, it’s the idea of building out a lake to get all the real benefits of the lack of scheme. You can put data in there and not really worry if the data changes, you’ve got changes in structure. You can put all sorts of file types in there, put lots of video, film. You can put loads of just interesting data in there, which you can’t really do if you’re working in relational environment.
And so many people have bought on to the idea of Databricks and Spark, and they’re doing their data processing in that lake. So we’re getting there to written. We’re really dynamic. We’re scaling out, we’re preparing the data and cleaning the data, transforming the data. So all that good ETL, that old fashioned data engineering is really, really key and really, really embraced in the lake. One of the data science is also patenting there because it’s just a great environment to do it in it supports Python and scholar and Awe. And you can do it in a scalable way across that same data. We’ve got our data engineers and our data scientists are very happy in this architecture. But then because of this traditional thought that, “No BI has to be in a relational database,” the people are picking up the data that they’ve prepared and curated in the lake and then putting it into a relational database.
And then we have our BI developers, developing fans and dimensioned and Kimball style analytics inside this separate environment. So essentially, I’ve got two copies of the same data. And then we’re taking that data and we’re putting it out to a BI tool, be it Tableau, Looker, Click, Power BI, whatever it happens to be for our report consumers. And they don’t really care where the data comes from, honestly. They just want that dashboard to work. So the challenge that we’re trying to work towards it’s to say, “Well, actually, we’re getting better at doing the whole BI thing in the lake.” So with Delta coming in, the Delta being that specialized file format that on top of Parquet, which is already heavily columnstore compressed and really, really good, but analytical style queries. It gives us things like an audit trail. It gives a schemer governance. It gives us the ability to roll back if we make a mistake.
It means we can do things like merge statements, which are huge when you’re managing things like slowly changing dimensions. So suddenly we have the ability to engineer all BI style data models in the lake. And so we kind of leapt on it said, “Great.” We could just start doing BI entirely from our lake using Delta, serving it through Databricks. And then you start to realize that it’s not quite that easy. So our BI developers who’ve shifted up to the top, they’re now accessing Databricks directly. They’re coming from a very mature place. They’ve been used SQL server tools, which have been around for decades, which are really, really mature at letting them query SQL, develop SQL. It gets inquiry. Hence it finishes their typing for them. It’s just really, really nice designed environment for writing SQL. And now they’re writing SQL in a notebook, which doesn’t quite work.
It doesn’t quite fit. You can do it. It’s just not as nice as where they’re coming from. And that’s a challenge when you’re moving to a new technology. You need to bring people with you. You need to be at least as good as where they’re coming from. And the same with report consumers. If they’re connecting into a traditional Databricks cluster and they’ve got to piece together the connection string from different parts of an advanced configuration pane, that’s not really good enough. People aren’t going to do it. So SQL analytics is to address this problem. It’s trying to say, “Well, actually, how do we work with these guys?” How do we take our BI developers, who love writing SQL, they’re passionate, SQL developers. They care about performance. They care about how our data is modeled, but they need this environment that has typedown, has code snippets has all these interesting things to make them feel at home and to make them productive.
And they’re not going to write in Python or SQL. They’re going to write in SQL not scholar, to wait and see what scholar. And our report consumers, they just want it to work. It just needs to be seamless. It needs to not seem threatening. It needs to be really easy to get them to just pick it up and run with it and work with it. So that’s what we’re trying to do. And that’s what SQL analytics is trying to say. Essentially, it’s saying we need to invite and embrace these different types of users into the environment that’s working so well for other people already. Okay. So let’s have a talk about SQL analytics. Where does that actually fit? What does it actually do?
So in essence, it’s like another new area it’s a tiny new workspace inside Databricks. So you’ve opened up the traditional Databricks environment. That’s now known as the data science and engineering workspace. And inside there, you’ve got absolute control of everything. When you design a cluster, you pick the size of VM, how many virtual machines you need? How many workers? How should it scale? How long should it stay on? What kind of initiation scripts should you have? There’s a lot of technical config that you can really dig down to the depths of. You can write in Scala or Python SQL. You’ve got loads of flexibility. And most of the development is either in notebooks or is Sparks cement jobs. Essentially, it’s all about software development on top of data. On our new site, in the SQL analytics workspace, it’s really heavily simplified. So designing a cluster, it’s just picking a tee shirt size.
There’s no more understanding the VMs and the memory constraints and tailoring it to your workloads. It’s a, how big you want it to be? It deliberately made really accessible and easy, for people who aren’t coming from a heavily technical background. It’s SQL only. So you can’t write Python and you can’t write scholar in there. It is literally aimed, laser focused, just for this SQL community. And so it has things like query builders to help you write SQL, to put snippets in there, to finish the SQL as you’re typing it, and then put it into a lightweight dashboard. So it’s designed for that kind of person. And so when you start using it, you’ll see these two different areas. Am I in data science and engineering, or am I in SQL analytics?
And you guys will have that choice going in there, which option do we want to go with? So if they find a workflow we’re kind of going to see is this heavily on the engineering side, we’re getting the data in, we’re preparing the data and we’re surfacing the data as hive tables. And that is core to how SQL analytics is going to work. SQL analytics expects there to already be data registered as hive tables. I’m expecting to go in there and see a list of databases and tables I’ve got access to. So the engineering workspace is key for actually that data preparation, acquiring the data, bringing it into the system, getting it into the right shape so I can then present it over the fence to the SQL analytics side. And at that point, I’m going to be writing queries on it, shaping it, analyzing it. I know that pushing a SQL directly from a power BI tool, Tableau, something else, all building a dashboard. So we’ve got that separation of the kind of work people are doing. And that’s not to say, you can still write SQL on the data engineering side of things, but it’s just not as geared towards it. It’s not as optimized for writing SQL in that environment.
Okay. So we’re going to step through some of the key concepts. And again, it looked fairly familiar. It’s actually built to be very, very kind of recognizable in the Databricks workspace. So we’ve got these areas going down the side, which are our different key concepts. And I’ll just take a moment to step through these different areas, to give you an idea of what to expect. So I kind of skipped a little bit down to the bottom to start off with, and we’ve got this idea of endpoints. The way I’m in Databricks in any of the Spark development, we’re expecting a cluster. We’re expecting a driver and worker nodes, and we’re expecting to design all the actual machines that our work’s going to take place on. In SQL analytics, we’ve just got an endpoint deliberately simplified, and you can see that we’ve got cluster size of medium.
So it’s completely obfuscated. And it’s really, really nice to say, “I just want this bigger cluster. I want a small. I want a large. I want an extra large,” to remove all the complexity of actually building and designing a cluster. So we need an endpoint. It’s still Spark, it’s still running as a Spark cluster. We have to have some kind of cluster to run things on. And that in SQL analytics is called a SQL endpoint. So we’ll spin up an endpoint and we’ll be able to do some work on top of it. And then we have this idea called a query. Now a query is a reusable data object. Essentially, I’m writing some SQL and that’s going to return some results. And I can use those results in a few different things. Essentially, it’s a building block of data. So that can be really simple, just a select star from a particular table.
I can actually have sub-query CTS. I can actually have a fairly complicated SQL statement, as long as it returns me some data. And then that data can actually cash it and I can use it several different visualizations. So I can have just one data set. One query that goes to the data, does some calculation returns a set, then actually use that in several different visualizations so it’s most efficient to refresh it. And this is where all of the advancements in terms of SQL writing up. So that SQL box, we’re going to write some SQL. It allows me to type down. It completes words, it allows me to browse what hive objects I’ve got access to. So it’s a nice, easy area to write SQL, save it, and then reuse those cached dataset in various different places.
Once we’ve got a query and once we’ve built some visualizations off it, we can pin them into a dashboard. So we have these kind of collections of all the different visuals that we’ve built and made. And we can have a refresh schedule or go off and run each of the individual queries behind it. And we can share this with people. Essentially, this is our way of presenting our various different visualizations. Now this isn’t to say this is going to replace your enterprise BI tool. So if you’re already using Power BI, Tableau, some of the premium BI offerings, the dashboards are a little bit simpler from when SQL analytics. But if you don’t currently have those or you just need something really lightweight and quick, it’s a really nice option just to present some good looking visuals that you can share with other people to then go and do their own analysis and dig further into things.
Going on, we’ve got alerts. Now this is super useful. So if I’ve got a dataset, I can trigger various alerts based on it. I can say with that value, next time you refresh if the value is under a certain amount send me an email. Alert these people if there’s an anomaly. So we only expect a thousand requests a day. If we get 10,000, something’s gone horribly wrong, let the support team know. We can set various different data alerts up depending on the refresh schedule of our data. So we can kind of trigger various actions and build out on it based on various things that we’re looking at inside the data models that we’re building. Finally, one of my favorite areas is history. So a big piece that’s really been missing from traditional Spark is the ability to understand how people are using the data.
What kind of queries are people writing? If people are writing hundreds of different queries on different tables across all of my data, which is the table that’s not performing well? Which is the thing that’s slowing down the whole of my cluster? And we’ve always had that detail in the Spark UI. You can dig into the depths of the RDDs and see the actual execution plans and understand what’s happening at the Spark level. Well, that’s really hard for someone who works at the data model level, who works at the SQL level. You need to really understand Spark to understand the performance of the different stages and tasks, and what’s actually happening in the Spark engine. So here, we’ve got a really simple list of this is the actual SQL that they wrote. And then this is how it broke down from performance in different steps and stages.
So that gives us a really great way of just actually keeping track of who’s using our system. How many queries they write a day, and what tables are they hitting? Because then we can give that to our engineers to say, “We probably need to optimize these. Here are the ones that we need to work on. Maybe we need some bloom indexes to go and speed that table up.” And it’s that telemetry, that information, about how users are using our system, which is so, so valuable. So they’re the main bits and pieces that we’ll see inside SQL analytics. And what we’ll do now is we’ll do a quick tour. We’ll show you the dashboard, step through some of those areas of an endpoint in some queries and give you an idea of what it’s like to work inside that environment. So here we have the familiar look, which is the Azure Databricks workspace or what we now know as the data engineering and data science workspace.
And what you’ll see is right down at the bottom, I’ve got this little switch. That allows me to toggle between the two different workspaces. So if I’ve done some data engineering, I’ve created my tables, I’ve registered them with hive, and I now want to query them with SQL. I can then switch over my SQL analytics workspace. And you’ll see, it’ll take a moment to switch over. We get a nice little logo. And then I’m now in the new SQL analytics environment. Now this is currently in preview so you might not have that little button just yet. But when you’ve been enabled for SQL analytics, you’ll be able to do that and switch around. So I’m now in the SQL analytics workspace. So first things first, let’s make sure I’ve got an endpoint set up because again, it is still Spark. I need to make sure I have a cluster that’s actually running.
Going to go into endpoints. I get my list of endpoints that are there. So you can see I’ve got different endpoints that I’ve set up for my different end-users. My marketing guys have got one, my customer sales have got another, and one we’re going to look at is business intelligence. That’s already running. If I wanted to create a new endpoint, I can just create a new SQL endpoints. A nice, simple little box. I can call it my, is it a cluster? Not really. Cluster size. And again, this idea of a tee shirt sizing. So you’re looking at the DBU’s, you can kind of map it back to how big a Spark cluster in the traditional Databricks workspace you’d need to do to get this. But the whole idea is you don’t eat a think in that much detail. They’ve already picked the right kind of cluster, the right type of hiking currency cluster, doing Delta caching so you don’t have to think about it.
So it’s been designed to make it really easy for you to pick the right type of cluster for your workload. Now we still pay for the cluster while it’s turned on. So if you wanted to automatically turn off, if you don’t want to have to worry about remembering to press start and stop, we’ve got that auto stop. So I could say, “Well, actually just stop if no one’s written a query, if no one’s submitted any SQL 20 minutes, turn off the cluster.” And then the next person who tries to use it won’t need to wait three or four minutes while it turns back on again. Now there are some experimental preview options currently in here. And we’re not going to go too much into this retail for those. My idea’s you’ve got this idea of multi cluster load balancing.
Essentially, if I got several different use cases, I can actually build several clusters. So rather than having one person accidentally writing a huge query, that means everyone else who’s trying to use that same cluster slows down, what we can actually do is load balance across several independent clusters, known as endpoints in this case. So that is coming soon, so we don’t have access to that just yet. We’ve also going to this idea of photon. A photon is a polymorphic vectorization engine to throw lots of terms at you, but that’s essentially saying they built something really cool that can actually calculate more values at once than previously they were able to, meaning your queries go faster. Essentially, it’s a whole flavor of Spark made to go faster for this kind of SQL analytics queries. So that’s something you can turn on. It is experimental at the moment.
So they haven’t covered it for absolutely every different type of SQL you might throw at it. So use at your own risk, but really cool that that’s coming. And that’s going to really, really speed up the kind of BI questions people ask against it. In this case, we’re going to use my existing cluster. That’s set up and running. And if I need some other stuff in here, I’ve got my connection details and we’ll have a look at that. That allows us to connect to this endpoint using Tableau, Power BI, et cetera. And I can also see how busy is the cluster, how many queries have people written to it? And if you’re used to your normal Spark UI, you’re used to looking at Ganglia, which is horrendously complicated, and it has huge amounts of detail, it’s just a lot easier. Really simple to see how busy is my cluster currently endpoint. Okay, so moving on we’ve then got queries.
So this gives me a list of all the queries I’ve got saved. All those data building blocks I’ve created in the past. I can write a new query and I can just start typing. So you can see I’ve already got a database over here. This is a hive database I created in my normal Databricks workspace. I registered the database. I created some tables. I can see all the things that I’ve got access to as an active directory user. I can just start typing. So I can start typing select star from sales so it knows, and typing sales RT, data can do products and it will type down as I’m going. It’s being built for people who are writing SQL. It’s great. So when I’m happy with how that’s working, I can hit execute. That’s going to go off, run on the cluster, return the results, and then just give me a table of data down at the bottom.
I can go ahead and start using that. So we take one that I’ve built earlier, discard that, and go into my product performance. I can see there’s a bit of SQL. And one of the hard things is that people write SQL in all different ways, I’m not really happy with the formatting for that. And so some of those small niceties is we’ve got a SQL formatting button, so I can just hit formats and it applies standard formatting to my SQL, just a little SQL niceties that we’re used to if we’re in a SQL editor that now we’ve got access to in SQL analytics. So now that I’ve got this data set here, you can see I’ve already created a visualization against that same data set. And again, that’ll just refresh whenever I refresh my data. Right down to the bottom, I’ve got a little refresh schedule here. So I can say, “Well, how often do I want it to automatically refresh this?”
So I’m not dealing with jobs. I’m not dealing with a whole ETL schedule. I’m not having to use an external tool. I’m just saying “Actually, could you just refresh my data every hour, every day?” And it will automatically turn on the SQL endpoint, run the query, update the visuals, and turn off. So I can go and see the latest visuals without the endpoint even turned on. So people just need to see what today’s data looks like, I can refresh it overnight. And then half the endpoint turned off and they actually look at the most recent data and it’s really nice, really cheap. So going on to visualization, it gives me a whole list of different types of advanced visualizations I can dig into. In this case, I can just do a basic chart, gives me lots of options for what kind of chart I want to do.
So I can just do a bar chart, really simple, the category on there. I want to give it some sales and then I’ve got a little visual I can start playing with it. And it’s very similar, the chart one’s, very similar to the ones that have baked into the notebook experience. Then I’ve got a whole load of other advanced visuals I can start playing with. So I can go into the nth degree of changing around fonts and formatting and colors and all sorts of things in there. I can just take that visual and go, “That’s a nice visual, I’m going to stick with this.” And then what I do with it? So I probably want to include this in a dashboard. So I’ve got some options. Do I want to take the data and import it to Excel? Go play with it somewhere else? Or do I want to take this visual and add it to a dashboard? In this case, I can do… I have a product dashboard. I think I’ve got a sales dashboard. So I can take it, tag this visual, and push it to an existing dashboard.
So then if we have a look at the dashboards I’ve already got, so I can go in and I can see which dashboards I’ve got access to. I can give different people, different shares of my dashboards. I can set up a refresh schedule for this dashboard and it will automatically run all the different queries that need visuals I’ve included they’re based on. And there’s an amount of interaction. It’s not as super rich and interactive and slicing cross slicing different visuals as you’ll get with the premium BI tool. But it’s quite nice to go and explore bits of data, dig into things to go and explore the map visuals. They are all nice and tactile and interactive, but still it’s still not a premium BI tool. It’s just a nice, cheap, way visualizing data if you’re already using Databricks and you’re in this ecosystem. So moving on, we’ve got our history.
So you’ve seen I’ve be writing some SQL and I’ve run some queries and I’ve got this list of everything that I’ve just been running. I can go and say, “Well, actually we’ll that took two seconds. That took 17 seconds. That’s more than I expect.” I can dig into that, gets some information. What was the SQL that the person wrote to actually trigger that workload? How long did it take? I can open up the Spark UI and get the full details of SQL execution plan. What actually happened under the hood if I need that level of detail? I can also go and see what was the breakdown? Was it because it took too long figuring out and compiling that and turning it into a Spark job? Was it just the pure grind of working with the data? And I can get a little bit more detailed about how efficient the SQL that’s been written is.
And again, that’s not just for me. I can go in there as an admin to see all my different users. What kind of things are they writing and how slow is it going? Where do I need to target my optimizations? Then finally, we’ve got these alerts. So that’s when I can actually put this trigger in and I can say, “Well, based on one of my queries,” based on a query I’ve just written, how often you wanted to check and it will check every time it refreshed and then what kind of trigger it should have in there. This one’s been triggered. So that would have sent me an email because my total sales were above a certain amount. And again, it is just that easy to go in, pick the measure on that particular query, how you want to compare it, the value that you compare it to, what kind of email should it send? So if you’re using a help desk kind of tool and it should raise a service request, you can have it bake in a standard template.
Okay. So as a tool, there’s loads and loads of different things in there, lots of different niceties that are built for people doing reporting, doing exploratory analytics and just trying to write SQL and the data in a way that is actually comfortable and make sense for them. So we saw lots of features about how we can do dashboarding and visualizations and query analysis inside SQL analytics. But it’s also an optimized engine for actually serving analytics through other tools. So one of the big questions is how do I use things like Power BI and Tableau and Click to actually use the SQL endpoints from SQL analytics? So I was using power BI, for example, there’s already an Azure Databricks connector. So I can just open up, get data in power BI, choose as your Databricks, and it’s going to ask me the right questions to get that connection working.
Now that will work against either the straight forward, the data engineering and data science workspace, against the normal cluster, or it’ll work against the new SQL analytics endpoints. So I’m going to set this up. It’s going to ask me a few questions. So it’ll come back saying, “What’s the server host name? What’s the HDP path?” Now previously on a normal cluster, you’ve got to dig into the cluster settings and the advanced conflict to come in and pull all that info. And again, that’s just not particularly accessible to a BI end-user, someone who just wants to plug in a BI tool. Now we saw against the endpoint when I clicked on the connections, which were given some of those bits of information, and actually we’ve got server host name and HDB path, which are coming straight in. So it’s already built in such a way that you can kind of just get the information you need and nothing else. You don’t need all the access to the other bits and pieces that you might use if you’re doing a straight Spark connection and using all the other big data Spark tools, because it’s built expecting you to connect fairly traditional BI tool.
Okay, so just wrap it up. So we’ve seen all these different tools, all the different bits and pieces that are being brought in to try and help people achieve that lakehouse journey. So all of our new users where we brought in our BI developer, now joined our data engineer and joined our data scientists, accessing Databricks directly, but not quite because they’re using the different workspace. They’ve now got an area designed just for them, just for their kind of workload, for their kind of development style to help them do what they’re trying to do. And it is kind of just optimized in that way. And then we’ve got all the report consumers who can now just much more easily access SQL endpoint, direct their BI tools against it. And as that engine, as that SQL analytic cluster improves, as we see photon come online, it’s going to get faster and it’s going to get more optimized and it can get better at serving data out to those kinds of BI requests.
And that’s what SQL analytics doing. It’s not really changing the fact that people can actually use Delta and do all the BI engineering and do all the data engineering and prepare the data in right way. It’s just allowing these various different people who just want to access the data, use the data, and gain insights from the data to use the same set of tools to come on board and use the same platform to serve that data.
So thank you very much. Again, I’ll be around to answer any of your questions and as always, please do leave feedback. It’s incredibly valuable for the conference organizers and for us as speakers. Guys, hope you have a great session.

Simon Whiteley

Simon is a Microsoft Data Platform MVP, awarded in recognition of his contributions to the Microsoft Data Platform Community. Simon is a seasoned Cloud Solution Architect and technical lead with we...
Read more