Databricks SQL Analytics Deep Dive for the Data Analyst

May 26, 2021 03:50 PM (PT)

In this session, we will cover best practices for analysts, data scientists, and SQL developers exploring Databricks SQL Analytics as a solution for their companies. This guided technical tour of the product walks through:

• Creating and working with queries, dashboards, query refresh and alerts
• Constructing queries for semi-structured data, such as json, structs, and arrays
• Navigating the improved Spark SQL Documentation to find and leverage powerful built-in functions to solve common problems
• Creating connections to 3rd party BI and database tools (PowerBI, Tableau, dbVisualizer etc.)

In this session watch:
Doug Bateman, Principal Technical Instructor, Databricks
Kate Sullivan, Technical Curriculum Developer, Databricks

 

Transcript

Doug Bateman: Hello, and welcome to the Databricks SQL deep dive for the data analyst. I’d like to go ahead and introduce your presenters here. So first I’ll pass it over to Kate to introduce yourself.

Kate Sullivan: Hi. I’m Kate Sullivan, a technical curriculum developer at Databricks.

Doug Bateman: Thanks Kate. And I’m Doug Bateman and I’m a principal data engineering instructor here at Databricks. We are really excited to be here today to talk to you about what is Databricks SQL. We’re going to start out with a high level overview of what is Databricks SQL. And then Kate’s going to give you a guided tour of writing queries, creating visualizations, and generating dashboards using the product. After that Kate’s going to show you just how powerful Databricks SQL is by applying it not only to your highly curated tables in your lake house, but also using it to query the JSON and hierarchical raw data that maybe you found in a typical data lake. So you can generate business insights at any stage of your data pipeline. And finally I’m going to show you how you can use the same query engine that’s powering the Databricks SQL user experience to directly connect to third party business intelligence tools such as Tableau or Power BI or other database tools that you may be familiar with. So you can use the same underlying high performance engine for a lake house with any Databricks tool that you may prefer to use.
So let’s get started. What is Databricks SQL. Put simply, Databricks SQL is a new home for you, the data analyst, inside of Databricks. The goal is to enable data analysts to quickly perform ad-hoc and exploratory queries. So you’ll see up here on the left, we’ll express our queries in SQL and then we can turn around and generate high quality visualizations. And then stitch them together into dashboards, which will be the essence of our demo today. You can even create alerts that get fired off to trigger emails or instant messengers, pagers, or anything else whenever your data changes in a critical way, allowing you to respond to business needs even faster.
Under the hood, what this really means is that you’ve got your data lake with all of your data, your curated data and transactional data, stored using Delta Lake. But a data lake can also have unstructured and semi structured data in other file formats such a JSON. And all of this is then made available as tables using our central data catalog. On top of that we then have a high performance, high throughput query engine, the photon native vectorized query engine, which can stitch together and run queries at very high throughput at relatively low latency, giving you really great user experiences and performance and value add for your lake house.
All this is then made available to you, either through the user experience in the UI, that’s our analyst experience in the Databricks SQL UI that we’re about to demo to you. Or you can have tools connect directly to the SQL endpoints so that those tools that that you’re familiar with run SQL that is executed against your lake house using Databricks. So I’m going to go ahead and kick it over to Kate now to demonstrate the user experience of querying in Databricks SQL. Take it away Kate.

Kate Sullivan: Okay, so now you’ve learned a bit about Databricks SQL and how you can use it in your work flow, now we’re going to see a demonstration of some of the features and what you can do with it. And then we’re also going to dive a little bit deeper into Spark SQL and see what kinds of special functions we have access to, to manipulate data.
So I’m going to go ahead and share my screen. And what we’ll be looking at is first the Databricks workspace UI. So if you’re already familiar with Databricks, you’ll be familiar with this workspace UI. You can access if you have access to Databricks SQL, you can access it using this app switcher over here. That will open up the Databricks SQL UI. Okay. When you first get access to Databricks SQL you might be given a link or you might have to navigate from the Databricks workspace. If you don’t already have access to the Databricks workspace, you don’t need to worry about how to get there, you can just here however you were invited. When we first open up we can see that we’ve got, on the left side here, some navigational tools that we’ll be using as we work through this demonstration. Again we’ve got the app switcher so we can go back to the Databricks workspace if you also have access to that. We’ve got this little question mark here, which is handy. It’ll open up the documentation for Databricks SQL analytics. And then you’ve got your account settings. Fun.
So let’s get started by looking at our endpoints and making sure that we’ve got endpoint up and running. Endpoints are the way to power your queries and the way that you will access your data. So this demo endpoint is running, so that’s good. And I can go ahead up here and just click create and go straight to writing my very first query. So I’ve already created a few tables for us to work with today and let me just give you some context about that. These are IOT datasets, which stands for internet of things. IOT generally refers to systems of interrelated computering devices and machines. And in this case we’ve got a collection of data centered data. So data centered data will show up important information about data collection devices. Data centers are just a big room full of computers and the data centered data will show us information about our data collection. Like battery levels, we want to make sure our batteries there. Our CO2 levels, so we want to make sure that we’re staying within a safe limit for CO2. But also send us the location of the devices sending this data, the device names, et cetera.
So we’re going to explore this dataset a bit and build a quick dashboard to get started. So here I’ve got my running endpoint selected and I’ve got my database selected. I wrote all of our tables to the DB academy database. And then over here in the schema browser, I can see the tables that are available on that database. When I click on this table, I can see detailed information about the column names and the type of information in each column. So this is one way to investigate our table. Another way would just be to run a select all. So I’m just going to do a select all from, and then I’m going to take advantage of the auto complete feature that is available to us in the query editor. And then I can click execute or I can hit command enter to run.
So that’s executing and it brings up my table. Show me all the columns that I expected. I can see here that these devices are spread all around the world. So maybe it would be good to get a picture of how these devices are located. So maybe the first real query I want to write is going to help us put these values on a map. So instead of select all I’m just going to come in here and audit a count. And I’ll count my device IDs. And I want to store that as, that’ll give me the number of devices. And then one of my county code columns here is CCA3. So I’m going to use that because that’s a… shows me that three letter ISO standard. And then I’ve got this and it has become a bit messy so I’m going to use this auto formatting. What I did was press command shift F. The other thing that you could do is click this format query button and that’s going to auto format your queries. So I’ll just include approved by here, so that we’re counting by country. And I will run this hitting command enter.
All right. So that shows me that his query is executing and it returns this table that shows me the number of devices per country. This is not the very best way to view this information so I’m going to go ahead and add visualization so I can make this a little bit more clear at a glance. The type of visualization we want to see here is going to be a map. And then I’m going to name that devices world wide. Here we get to make selections about the particular values we’re trying to represent. This drop down shows me all of my different options so if I have information just in the USA or just in Japan we can focus in on those. But we have, again, data world wide. So we’re going to keep this countries map. And then for my key column I’m going to chose CCA3, that’s my abbreviation. And remember that that is the ISO three letter code for each country. In my value column, I want to choose number of devices.
Okay, that gives me a map. I can see the USA has the most devices, but I can’t really tell too much about this, so I’m going to go into this color space and I’m going to separate this information into quantiles. And then I’m getting a little bit more variation here, I think it would be even clearer if I chose a different color. So I’m going to chose light green as the variation. Now I can see sort of the different levels that are in this map. So I will quick save. And then this shows up right underneath my query. We’ll talk more later about how we can bring this up and share it. Not at the query level but at the dashboard level. But for right now I’m actually just going to make sure this query is saved. I’m going to name it devices worldwide. And I’m going to add a tag so I can find it easily later, even if I have a lot of other [inaudible]. I’m going to call this data AI summit. I’ll click okay. Excellent.
All right. So let’s go ahead and just create a new query. One way to do it, I could go out to this queries menu and I could see this map that I just created. It is this query I just created, it is available to me. If somebody else had shared a query in this workspace it would also appear here. If I was an admin I could see all of the queries that exist for this whole workspace. But for me, for right now I can just see this thing that I made and I could choose here to create a new query using this button or I could use this shortcut create button.
So I will create a new query. And I think what I would like to see now, is I want to get a sense of how much CO2 all these devices are putting out. I noticed that CO2 levels is one of our columns. I’m going to use case statements so that I can group those individual values that I was looking at into categories. So I’ve actually already written this. So I’m just going to paste it into the query editor and I will explain what it is. So here we are selecting first, again, I’m just counting the number of rows here. So I’m just counting the CO2 level as the number of devices. And I want to include the device name in my results set. And then something that makes this a little more complicated than SOSL for example, or the query that we just wrote, is this case statements.
So casement statements are pretty common to SQL languages but if you’re not familiar with it, I’ll just go through and explain this one.[inaudible] says keyword case. And then I’m laying out all the situations that I want to account for. So here I say when the CO2 level is less than a thousand then I want to tag that with a categorical string description here that is less than 1K. Then I’m saying when CO2 level is greater than or equal to a thousand and… is greater than or equal to a thousand and CO2 level is less than or equal to 1200, then I’m going to include this categorical label. And then anything other than that is going to be greater than 1200. I end that case statement with this end keyword. And I included as level so I can rename, so this is going to create a new column of all these categorical variables where if the CO2 level is less than a thousand but less than a thousand then I’ll get that label of less than 1k. And I’m going to call that new column level.
Again that is selecting from our DB academy, IOT data goals, and I’m moving my levels and device name. So I’ll go ahead and execute that whole query. And that has done what I expected. It calculates a number of devices. It includes my device name and it says that it is somewhere between a thousand to 1200 or less than a thousand or bigger than 1200 and those are the only values in there. So now I can use a bar cart. Instead of showing off this very specific information about the CO2 levels, I can use the bar chart to show how they fit into [inaudible] groups. I’ll go ahead and just add that now. And say I want that to be a chart, that’s true. And I’m going to just name this CO2 levels. And I’ll keep that as a bar chart. And we’ll say the thing on the levels are going to be the X columns and our numbered devises will be our Y columns.
And now here, event though we don’t have a group by statement, even though we’re not grouping by devices in the query, we can choose this device name group by to get a little more detail in our visualization. So this shows me that same data just grouped by device name. Okay. So I will quick save here. And then it occurred to me that this is showing me that worldwide data but there are definitely situations where I might want to just zero in on individual countries, especially given that there’s so much difference in how devices are spread throughout every county. So what I’m going to do is just include a a where filter. And I know that we had CN, that was country name. So I’m going to say just for CN equals United States. Okay. So now when I run that I can see that the results adjusted, and this is presumably all showing me data that exists just within the united states and that’s really interesting, except that if I wanted to look at other countries I’d have to go in and change this query every single time. And that’s not super useful.
So actually what I want to do is I want to allow the user to drill down to whatever country they want to focus in on by using a query parameter. So I’m just going to delete what we’ve got here. And a parameter is a way to allow the user to send in their own values. So I’m going to keep this filter working the same as it is and I’m just going to instead click the add parameter button and I’ll call that parameter country name, and I’ll leave that as a text box, and I’ll say add parameter. Okay. So we can see here that now we’re expecting a parameter. So if I type in United States, for example, and then click apply changes. Hmm. Interesting.
This is actually one of the problems with using this text box. I can see here that I actually haven’t used the right case and I haven’t accounted for that. There are ways to account for that in your actual query. But I haven’t accounted for that in my query. So just my user error. I typed in United States without caps and I didn’t get anything. But if I type it in the way that it actually appears on the table, I get matched results. We can eliminate this type of problem by using a query based drop down list. That’s another option for this kind of query. So I’m going to actually save this query. And here where I chose text box, what I’m actually going to want to do is this query based drop down list. Before I do that I need to create a query that has the drop down.
So I’m going to click this create button and click create query. And I’m just going to call this countries because this is going to hold all of my countries. And I’m going to say right here in the tag that this is a query I’m using just to the parameters. And then I just need to create a list of all the countries that exist. So I can do select… actually I’m going to select distinct. Country name. From. DB academy. IOT data, cool. Let’s format that up. And that’s going to give me all of the countries. Oh, it’s easier if I order that. It’ll be easier if I order that and I can give whoever the user is an alphabetical list of countries so they can find what they’re looking for.
All right. So with that I’m going to actually go back to my other query which I hadn’t actually named. I’ll take care of that right now. I’ll call this CO2 levels by country. And then instead of where we have this text based drop down, I’m going to change that to a query based drop down list. And I can find my countries parameter query right here. I’ll click okay. And then there’s one more thing that I need to do to make sure that this query runs and that is just this parameter is going to get filled into this query and I need to make sure that it, as a string, is actually wrapped in quotes. So I’m going to put quotes around this parameter value. And then when I select a new county I’ll get the option to apply changes. And then here we go. Turkmenistan does not have very many devices to account for. But if we look at Ukraine… okay. So we’ve got our visualization all still working. We’ve just subbed in a parameter so that now anyone who looks at this can decide which country that they want to focus in on. Great.
Another thing that I could do here is lets say I want to use this form of a query. I want to make a really similar query because this seems like it’ll be useful for a lot of the values that I have. But I wanted to make it… I don’t want to rewrite the whole thing. What I’m going to do here is I’m just going to save this query and then I’m going to fork it. Okay. So I’ll save this query and I will fork this query. And a fork will just create a copy of the same query but we’re going to change up a bit of this logic. So I’ll say temperature by country. And instead of CO2 levels, this isn’t so important, this is all these same number of rows. But I’m going to sub in the word temps, because that’s my column name, for everywhere that I have CO2 level. Okay. That’s all great.
And then my temperature, I know from inspection earlier that my temperature values are listed in celsius. So we’re also going to need to change the ranges of this quite a bit. So I’ll just chose 20 degrees. Between 20 and 30. Greater than 30. I’ll change that in the string categories as well as in my code, so now I am… less than 20. Great. All right. Now, I’m keeping everything else in the query the same. But this should, theoretically, work the way that I expect. Okay. So I’m taking in this new country value. The parameter is still working. We’re getting the information back as a table. And if we check on our visualization, I can see that I’ve still got working visualization and it’s updated to the right labels. The only thing that is wrong is actually the CO2 levels titles. So I’m going to click on edit visualization. And I’ll just call this temp levels instead. Okay.
So at this point I’ve got a couple of visualizations. So I’d like to add them to the dashboard so we can keep track of this information. And I don’t have any dashboards yet, so I will create a new dashboard and I’m going to call this IOT device data. Great. So this brings me into the edit mode of my dashboard. You can add a text box here if you want and that’s good for describing your data. And this supports basic markdown so you might have something like IOT data worldwide. And say this is a collection of information about IOT devices or about data [inaudible]. Okay. I’ll add that to my dashboard. And don’t worry about where anything goes immediately because you can drag it once you have a bunch of visualizations, you can drag and drop and rearrange your dashboards as you wish.
So now that I’ve added that text box I’m just going to go ahead and add my visualizations also. So I want to put my devices worldwide up there. And under visualizations I have the option to show off that table that gets generated and shows the query result set. Or I can show the map that I created. I’m going to choose map. I’m going to add another visualization. I need it to be CO2 levels by country. And notice here that I have a parameters detail, because this is a query that includes a parameter. First I’m going to make sure that I’m showing my bar chart and not my table. And then I have the option of choosing whether or not this query parameter is going to show up as a dashboard level parameter or a widget parameter. I don’t want to set a value for this because that would take away the functionality that I was hoping to get.
So I’m going to choose new dashboard parameter because I know that I have two queries that I will want to control at once. Two visualizations that I will want to control at once. So I’ll click add to dashboard here. And then finally I’ll just add my third visualization. Whoops. Instead I will add my third visualization and I will make sure it’s my bar chart and I will click that. Add to dashboard. And now we have this country name being selected. I’m going to move this actually so that you can see them sort of right next to each other. And now I can see that when I change my query parameter, both of these are going to change at once because they’re both referencing the same country. This map doesn’t include the parameter at all so that doesn’t need to change.
All right. So I’m going to click done editing here. And then I’ve got a dashboard. You can share this with people who are also in your workspace using this button. Right now this is actually only visible to me, but I could add it to all users or I could add it to any of these email addresses that are also within this workspace.
Another thing you can do is schedule a time for this to refresh. So we imagine that this data is changing. And we would want to keep track of important changes. So we can refresh, set a schedule for these visualizations to refresh. And what that’s going to do is run those underlying queries again and show the new information. You can attach that to a SQL endpoint so that you’re running on some particular endpoint. That’s available to me now because I haven’t actually selected an interval for refreshing.
But what if we’re in a situation where our data has changed and we’re not paying attention to it? We also want to be made aware of changes in our data even when we’re not keeping a super close eye on it every single time it refreshes. So what we can do for that is actually set an alert. An alert, you can attach to a query. And this particular query is actually not set up right. It’s not tracking these individual temperature levels. But if we were tracking individual temperature levels, we could select the query and say, all right, I want to make sure that I get notified if any of my values go above 30 degrees. And when you do that the default behavior is that it’s going to email you and then it’s going to show up within your, within this SQL analytics UI. But you can also set this up to integrate with Slack or PagerDuty or whatever your company typically uses for notifications.
All right. Cool. Well that does it for this first part of the demo. I think we’re going to go back and we’ll talk a little bit of more about how to handle some data that’s a little bit messier than what we’ve been working with. And then we’ll jump back into another demo in a little bit. Okay. So thank you very much.

Doug Bateman: Awesome, Kate. So, so far we’ve worked with relatively flat, easy to query tables. This is consistent with our Delta architecture. The multi-hop design pattern in our lake house where the gold tables represent a really processed, easy to query business aggregate, analytics friendly data. Which is where you want to be doing most of your querying. Compare that by contrast to the bronze data which would be the data that’s been ingest from your various different systems in it’s raw form, specifying where I got it, when I got it, and what was the data without any changes made.
But between the two, we have this silver level data. And the silver level data really represents a cleaned up version of the bronze data where you’ve filtered out invalid data, you’ve normalized it a little bit, you’ve standardized the data formats, and you may have augmented it from data from other tables. And the idea of the silver data is that this really represents the broader enterprise data model. So whereas the business level aggregates might be your star schemas really refined, the silver level is your broader enterprise data model. And very often in any really large system, we find that while most of the time we can get what we want from these gold tables, sometimes we need to go back towards the sorts of truth to find missing fields or information that we’ve never previously been interested in but we are interested in it now and we want to gain some insight from it.
So one example here would be from our internet of things schema, here’s some data that really came into our silver tables from a hierarchical JSON data source. So you’ll notice here that we say the date that we captured this internet of things data, here’s the device that we’re capturing. But then source, and I’ve got this map. Just this unstructured map of different sources where my data may data might be coming from with the key value pair where did the data come from and what was the data. A description, the IP address, and then what was the temperature we captured and we might have multiple temperature readings in an array. And similarly multiple carbon dioxide levels. And this is what our sensor data is picking up.
It’s not at all unusual that you find that sometimes you want to go back to that original raw sensor data. But querying these more semi-structured types, less structured, more semi-structured data types, requires a more sophisticated vocabulary in your query language. And this is one of the places where a lake house really shines. Because lake houses know how to work with semi-structured and unstructured data types ingested from other systems, as well as these really nice friendly, flat, [starthy] type tables that you’d see in a traditional data warehouse.
So Kate’s going to show you how Databricks SQL and the query engine provided by the Databricks SQL system really allows you to query this type of data. So take it away Kate.

Kate Sullivan: Okay, so now that you know we’re going to be working with the silver level tables and that that data is generally a little bit messier than the data in our flat gold level tables. So we’re going to be working with some nested data and some array type data and I’m going to show you a couple of cool functions that you can use for SQL and with Databricks to handle that kind of data.
So I’m just going to start by just sharing my screen again. And we’re ready to jump back into writing queries. So you can see here I’ve got all of the queries that I wrote in the previous session. I’m going to go ahead and just click new query. Okay.so like usual the first thing that we probably want to do here is just visually inspect the data, see what we’ve got to work with here. So I’m going to select all from DB academy, IOT data silver. I just used the arrow to drop down to that other table. Then I’ll hit ctrl, alt to run.
We should expect to see some complex data here because we already saw the way that these columns are mapped out. And as we expect we have these three top level columns. These two are the date and source and then within source we’ve got sort of a messier… I’m going to just focus in on that. And remember that source actually holds a map type. That holds some key value pairs. And actually if you use the explode function… oops. That’s going to separate those key value pairs into two columns. So here we have the key which looks like that’s holding just the device type, that’s that one top level descriptor. And then the value, which holds a whole bunch of definitions.
So we’re probably going to want to flatten that out more to get to that value, but before I do that I’m just going to make sure that I’m getting all the information from the original table. So I want to include the PC ID. I’m going to take this opportunity to actually turn the date into a date type. And I’ll alias that as dates. And then I’m going to format all of that and when I run this I’ll get a very similar table, but now I have all of the original information that I had before. So the problem is I would like to be able to flatten out this value column a little further but I can’t really access it right now. What I can do, actually, is you could just treat this as a regular sub query and query that table. Or I’m actually going to create a common table expression that uses it. So I’m going to name this sub query and then query that sub query in a common table expression.
So I’ll say with exploded source. This common table expression allows me to query this table which is this results set directly, right in this query. So I’m going to write my new query now that is actually referencing the results of the query of both. So I’ll just write select. I’ll choose key. And I want to make that a little bit more descriptive so I’m going to call that instead of key, I’ll call it device type. Then I want to include the dates. And I want to include this value column. And actually I can use that notation to reference these titles from my value column. So I’ll choose value that description, and value that IT, and value that temps, and value that CO2 level. Okay, format all that. Okay. So that’s all the things I want to select, and remember I’m selecting that from our exploded source table. And notice even that local definition for the exploded source query comes up here.
So I’ll hit ctrl enter to run. Okay, so great. Now I’ve got all the original data that I was working with before. It’s all pretty much flattened out except I’ve got those two lingering array columns that are temps and CO2 to deal with but I no longer have any whole nested objects. What I’m going to do is use this common table expression to create a new table in my sandbox environment. Generally when you’re working in this lake house pattern you’re probably going to be working on data in production. That is you’re going to be pulling in new data that is actually part of your production work. What we want to do when we’re messing with data like this is write to somewhere else. Somewhere isolated that’s my own private sandbox environment where I can dig into the data and get a little bit messy and not have to worry about any sort of changing anything or copying anything by accident or any downstream dependencies or anything like that.
So I’m actually going to use this to create a table and I can do that by just saying create, table does not exist. And I’ll call that… well I’ll but my sandbox ID which I have created already. That database. And I’ll just call this flat IOT. All right great. If we want to use the skin browser to check this out, I can use this refresh databases internal email. Follow that up and I can see that it’s been written. That I have my flat IOT table and I’ve got all of my columns and their respective types.
Now that I’m done with this table creation I’m actually just going to delete it just to make sure we have this exactly as we expect. I’m going to select all on that flat IOT database. Sorry, IOT table. And just like I expected it’s going to show me exactly what we have as output for the last one. Okay so here we have our mostly flat table. Everything here looks really good. It looks really easy to work with, except we’ve got these two columns that contain arrays. Now arrays are historically difficult to work with in SQL. We have all of the SQL functions and all of them expect for you to send in a single value. What we’re going to use now is a higher order function, which are simple extensions to SQL that can make working with arrays a lot less error prone than some other traditional methods.
So many of you get started really by just jumping in and writing some higher order functions, and while I do that I’m going to explain how they work. The first one we’re going to work with is transform. So within this flat IOT data I’m going to just select a few columns that I want to show up. Device type. And I want to show the temps column. And then I’m going to want to do a transformation on this temps column, because I have all my values in celsius here but I might like to show them in Fahrenheit instead. So in order to do that I’m going to start out by just naming the function I want to use. Just like you would for anything, like average or men, or whatever. And within in that function I’m going to send a couple of different arguments.
So the first argument is going to be the column that holds my arrays. So this always needs to be, in transform, always needs to be a column that holds array values. This is meant to work with array values. The next thing I want send in is an iterator. And that iterator is just something that I’m going to use to take the place of each of these list values as I move through the list. So what I’m going to tell this function to do is I’m going to define right here. And I’m going to use T as my standard value for every single one of these. So if we do this really simply, if instead of making this conversion I just wanted to multiply T times two. I would say okay, for every value in this list, I want you to take each value and I want you to multiply each value by two. And then if we were to run that. We get a one to one match on number of elements in each array. But 16 has been transformed to 32. 13 has been transformed to 26. 19 has been transformed to 38, and so on.
So that’s all fun but multiplying this temperature by two isn’t exactly meaningful. So what I’m going to do instead is actually use the conversion formula for converting to Fahrenheit. I’m going to take each one of these temperatures and multiply it by nine. And then I’m going to divide that result by five. And then I’m going to wrap this in parentheses. Even though order of operations probably has me covered. And I’ll add 32 to that. And I’m going to alias that as temps F. Okay. Okay. So now I can see that each of my celsius temperatures has been converted to a Fahrenheit temperature. Transform is really useful because it’s straightforward. It takes an array. You can choose anything to be the iterator variable. And then here you just define whatever you want. Whatever you want it to do.
The next section that we’re working with is going to be a little bit more complicated, but it’s awfully useful. Again I’m not going to save this query, I’m just going to overwrite it. The next function that we’re going to be working with is called reduce. So I’m going to again, select device ID, add device type, and I’ll include the temps again. And then the thing that I want to use here is called reduce. And what that does is instead of giving me an array back out, this will give me a single value. So this is good for I want to take this array and I want to summarize it somehow. So if I wanted to get an average temperature, for example. I could use reduce to merge all of these elements into a single summary value. So the first thing we’re going to send into that reduce function is this array. So same thing with transform, the first argument that we want to send in here is actually the name of the column that holds our array.
The next thing we’re going to send in, I’m going to write it as zero and then I’m going to come back and explain it later. Our next argument is actually going to be the two variables that we’re going to send in to our functions. So one, I’m calling T, and the other I’m calling ACC. ACC here, I’m using to stand for accumulator. The way that this function works is we’re going to be looping through the list and sort of keeping track of the running total of the sum of all of the elements as we work through the list. So in order to get the average temperature here, we want to sum all the elements in the list and then we want to divide by the number of elements. Then this second variable, ACC, is going to help me keep track of that running total, while T sort of visits each item in the list and adds it to that accumulated value.
So I said I as going to explain this zero value later. Zero sort of initializes that accumulator variable. So it gives it a starting point. If for some reason you wanted to start counting at five, you could do that. But in this case that would throw off our calculations so we’re not going to do that. Okay. So I’m sending in these two variables, I’m defining them right here, and then I’m going to point to the very first function which is going to be I’m just going to add this iterator to the accumulator variable. So for example here the first value of T is going to be 16 and the first value of ACC is going to be zero. So I’m going T, 16 plus zero, the result here is going to be 16. And that’s my new value for ACC. So the next time this function runs, T is going to be 13 and that’s going to get added to that running total, that 16 to make 29. Now the value of ACC is 29. Then it’s going to get added to the next value. So the ACC is keeping track of that running total as I go all the way through the list.
Once that gets all finished I’m going to take that accumulated total and I’m going to send that to a new second finishing function. And that finishing function is going to do the work of dividing all of my values, or dividing my accumulated total by the number of values. So I’m going to say by accumulated total divided by, I can get the length of a list by using the function size. And I’ll include the list name in that. And then I’m just going to alias that as average temps. Okay. Don’t forget that I will need to include my from. That’s going to be my sandbox DB. And then I’ll want to sort that by average temps ascending. Okay so I’m just going to go ahead and format this here. [inaudible]. And then when I run this. And no I can see I’ve got my DC ID, device type, my original temps list, and then I have my summary temp for every single one of those list values.
And those are all the higher order functions that we’re going to go over today. If you want to learn more about higher order functions and play with them a little bit, we’ve included some links to one like SQL analytics function documentation. And then also some demo notebooks that you can use to start playing with higher order functions.
Okay. So next you’re going to be seeing a demonstration of how you can connect to tableau.

Doug Bateman: Thanks, Kate. So now let’s get started talking about how we can connect third party tools to the same underlying query engine that powers Databricks SQL. So first and foremost one of the really great things is because Databricks SQL is available as an ODBC or JDBC endpoint, you’re going to be able to connect a whole host of different tools that you may already be familiar with, directly to your lake house to get insights out of your lake house without having to first copy the data into some other system. This is really nice in terms of keeping all your data in one place and having a single source of truth.
So the first thing you’ll need is either the JDBC or ODBC driver which can be downloaded from the Databricks website. And many of the tools that we’re talking about already have the JDBC or ODBC pre-bundled to make it even easier for you. And again, remember that under the hood all of these tools are really just using JDBC or ODBC then to sent the queries to a SQL endpoint. Which is using the same powerful, vectorized query engine that’s been powering all of the queries that Kate has demoed so far. And all of this is then gaining insight directly from your data lake without having to first copy it into any other system. One single source of truth.
So with that in mind, let’s get started with tableau. You’ll see here on the left, I’ve got the SQL analytics UI that Kate has been showing you. And on the right I’ve gone ahead and launched tableau. Now I’ll come over here and click on the endpoints tab on the left in my SQL analytics workspace. And then I’ll come over and click on my SQL endpoint. This is the server that I’ll be connecting to service my lake house queries. Come over here to connection details. And this tells me the JDBC endpoint that I’d like to connect to.
Now on the right, I’m going to come over to tableau, and I say I’d like to connect to a server. I’m going to come down here to more and after more, I just type Databricks. And I click on Databricks on the right. It then asks me what host name I’d like to connect to. Let’s clear out my old Databricks using names and host names. And I’ll copy the host name and I’ll copy the http path. I just click on this little icon for a quick copy. And then for the user name, for higher security what I’m going to do is generate a personal access token. So for user name I just put the word token. Then I come over here and say create personal access token. And I’m able to generate a one time token. So I’ll just say this’ll be a token that I’ll use for tableau. And how long would I like this token to be active for. I’m going to set it for one day because you guys are all going to be able to see my tokens and I don’t want to expose anything. So I click generate, and I click copy, and I’ll just paste this over here into the password field so that I’m logging in using a token.
Now what’s really important is to make sure you get a good copy of that token and your tool is set up because you won’t be able to get the token again once you click okay. So now tableau is connecting. There we go. And it sees that it’s able to connect to my database. I’ll go ahead and click okay on the left. And now I’m just going to go ahead and make tableau be full screen. At this point I get to decide what schema I want to connect to, and I can start building my query using the tableau to user interface. So this time around I’ll come down here and I’ll say I want to use the DB academy schemas. I click search. And let’s take a look at the gold tables that we’ve been querying all along. So I would click on this, now I get to search for my tables. There we go. And let’s take a look at the events gold table.
Now we come over to the sheet and here I’m finally able to start querying my table. I might be interested in looking at events by city. And in particular, I’m interested in knowing how many events are in each of my cities. I then come over here and I get a really great city plot, using tableau, showing where all of my different internet of things events are happening. And I get this great heat map showing the most important cities when it comes to my internet of things. It even gets it out here in Alaska and Hawaii in my dataset. Really cool. So what this demonstrates is the ability to use my existing tools such as tableau or power BI to query directly to my lake house, enabling me to have a single source of truth to query for all the different tools in my organization. This is the power of the scalable lake house. To work with structured and unstructured data and serve as a single repository for knowledge in your organization.
And with that in mind, I’d like to thank you for attending our presentation today. Please click on the feedback link inside of the meeting play conference application, and let us know how you found the session if it was helpful to you. Also, please keep in mind we’re going to be around answering questions. So put your questions into the chat area and we’ll start answering them. Thanks again for attending, and enjoy the conference.

Doug Bateman


En savoir plus

Kate Sullivan

Kate Sullivan

Kate Sullivan is a Technical Curriculum Developer at Databricks with over 15 years experience designing technical courses. Prior to joining Databricks, Kate was the Director of Curriculum and Training...
Read more