Learn more about Redash from our summit keynote during this product walkthrough. Redash makes it easy to explore, query, visualize, and share data. Get started easily with simple SQL queries, quickly turn them into dashboards, and develop them into powerful workflows with Alerts and APIs. Data teams can leverage Redash with the complete, reliable, and up-to-date data from Delta Lake as part of a lakehouse architecture. We’ll show you Redash in action, as well as how it’s been used by our data science team to develop market insights.
– Hi everyone and welcome to our product walk through for Redash on Databricks. My name is Jesse Whitehouse. I’ve been part of the Redash team since 2018. I do success engineering, I make videos, maintain our documentation and write a lot of SQL. Joining me is Francois from the Data Science team at Databricks Francois would you introduce yourself? – [Francois] Hi everyone, my name is Francois Callewaert. I joined Databricks in March and I have been working as a data scientist in the data team since then. As a data scientist I really want to know the data I’m working on deeply. And Redash has been the part tool for that as you would see. – Awesome, thanks Francois. So, here is our agenda this session. First, ill give a brief summary of what Redash is and who it’s for. Slide, all right there we go.
Here is our agenda for this session. I’ll give a brief summary of what Redash is and who it’s for. Then we’ll quickly walk through the product abd it’s main features. Next Francois and I will demonstrate a couple of use cases and show you Redash on Databricks in action. And finally we’ll encourage you to sign up for the private preview of Redash on Databricks coming this summer and take a few questions. At the end hopefully you’ll understand why Redash is awesome and a great fit for analysts on the Databrick’s platform.
So first, what is Redash? Redash is an open source web application. It’s used for clearing databases and visualizing the results. Redash on Databrick is now the easiest way to query, visualize and share insights from your Databrick’s data lake. And it’s built entirely for SQL native users. At heart that means we’re a SQL Box. So you write a query, you hit execute, you see results, it’s really simple. But we also provide fast visualizations right off the query screen. And dashboards for collecting charts together. Redash also makes sharing your work easy with direct links that work across your organization. And we support scheduled queries to run on the background similar to the Databricks job. And you can also set up alerts based off those scheduled queries if you want to measure metrics passing a certain threshold. And of course it’s all driven by SQL queries. So, who is it for? Now, let’s talk for a moment about Databricks as a whole. It’s a unified platform for data teams which includes data engineers, data scientists and data analysts. For data engineers, Databricks offers the best Apache Spark experience.
For data scientists, Databricks implements powerful notebooks that can run SQL, Python, R, Java and Scala. These are familiar for anyone doing scientific research or ML work, like Francois for example. And finally for data analysts, Redash is your familiar interface to the data. And even for data scientists, we provide the easiest way to share plots and dashboards without sending a note book link. And of course, once the SQL is written, you can tuck it away in the background so that visitors can focus on the plots. So let’s go through quickly what can Redash do? Well first Redash is tightly integrated with Databricks. You can log in with your Databricks credentials and tables in your data Lake appear inside Redash automatically. With Redash on Databricks, there’s no need to configure database connections or firewall rules, which can be a pain point when using our open source distribution. The Redash SQL editor makes life easy. You write queries in Spark SQL, there’s a database schema browser, which powers our auto-complete for table names and fields. And the table of query results appears just beneath the query editor. But our tables actually have a few tricks up their sleeve too. You can include links, images, JSON blobs and formatted HTML. It’s also easy to add visualizations to queries right from the query screen. So you can iterate quickly to get just the right data presentation. Most common visualization types are represented like bar charts, line charts, area and pie charts. We also have heat maps, box plots, pivot tables, just about anything you can imagine we can do it. And once you have something of interest, you can bundle it into a dashboard. We offer flexible execution that goes along with this. So there’s query parameters that are friendly for business users. You can schedule your queries so that dashboards remain fresh. And you can even configure alerts like emails, pager duty or slack notifications. So you can backup your queries, trigger query refreshes or even use Redash as a gateway for query results into tools like Google Sheets. And the best part is that links within Redash can be shared around your organization as a single source of truth. There’s no need to email screenshots or PDFs. For these reasons, Redash is a natural fit with Databricks. We are a powerful front end with a powerful engine behind us. But if you’re in a pinch, Redash also works with nearly 50 SQL and no SQL databases. And most Rest API’s, including its own API. All of which comes together to make Redash a really natural fit for working with the Databrick’s platform. So from here, let’s do a quick walkthrough of the main features of Redash. So here we are inside the Redash application. This is the main screen that you see when you first log in. And normally you would see my favorite dashboards or favorite queries, but since this is a fresh installation, there’s not much to look at. When you start using Redash, the first thing you’re going to do is usually to write a query. So we can do that from the Create menu and hit Create a new query. This is the query screen, which has three main components. The first part is the SQL editor where you write Spark SQL queries. So if I want to write a really simple query here and get some results, I can do that like this. And if I hit shift+ Enter, Command+Enter or Ctrl+ Enter, it will execute the query and show me results. This is the second part of the screen. By default, you’ll see a table but you can also add visualizations from this tab. You may have noticed when I was writing the query before that as I started typing the name of a table, it was auto-completed for me. That comes from this schema browser, which appears over on the left. Any database or table that exists inside our delta lake is going to appear here automatically. Which is really convenient if you’re loading data from a notebook and you want to quickly investigate it inside of Redash. So we created some sample data that we put in a database called saas demo. You can see the names of the tables here. And I was querying the accounts table. Which is why it was able to auto-complete for me. So let’s make this query a little bit more meaningful. How about we get the card country for these accounts, we’ll just call that country and account of star, that we’ll call number of accounts. Just like that. Then we’ll need to, we’ll need to add a group by since we’re doing some aggregation. And we’ll order by to descending. Obviously, you don’t have to write the query all on one line, you can hit this format query button, and it’ll make it a lot easier to look at. Now, when we execute the query, we can see all of our accounts that exist on this table ranked by number of accounts and country. So looks like the most number of accounts come from the United States then Japan, Germany, Israel and Korea. I’ll go ahead and name this query. We’ll call it number of accounts by country.
Just like that. And let’s go ahead and tag this. Let’s just say it’s part of the SAIS demo, that way. I’m done with the query for now. So I can actually just click this button that says show data only. And now let’s add a visualization. We’ll make a bar chart that just shows the country and the number of accounts. And by default, Redash is going to sort this axis alphabetically. But in this case, I want to preserve the sort order from the query, which makes it a little easier to read. We’ll call that axis country of origin. And the y axis, we can call number of accounts.
Since number of accounts is related to revenue, let’s make that green. As a final part, since our axis labels are pretty good, we can even hide the legend. That way we get more space in the view port for showing the chart.
As you can see, when we’re on the view query screen, you don’t see the SQL which prioritizes showing as much of the visualization as possible. And there are a couple other cool tricks that are tied to the screen. We can, for example, set a refresh schedule for this query. It defaults to never but in this case, let’s just add a refreshed schedule so it runs every five minutes. That way, every time we visit this query, we know that the result is no more than five minutes old. From down here, we can also look at the API key for this query. As I mentioned, everything you can do from the front end, you can also do via the API. So if we copy this results in CSV link, we can go into something like Google Sheets, and we can import the data just immediately. Just like this Next, I’d like to add this into a alert system. I’d like to receive an email if our number of accounts goes over a certain threshold. To do this, I’m going to make a couple edits to this query. It’s useful right now, but I’d like to add one more column. So I’m going to take the original query and wrap it in a common table expression. Which we’re going to call base.
We can select star from base, and then we’ll also add a window function that gives us the sum of number of accounts, over order by ,select one. And we’ll call this total number of accounts.
Just like that.
There we go. Now I can save this query and show only the data. And you can see that the total number of accounts that exists on the table is shown all the way on the right, there’s 696 of them. And of those 269 were created based out of the United States. In this case, I’m only going to use this column for an alert, so I can edit the visualization for the table and hide this column.
Next, to create an alert, I use the Create menu. And I can search for the query I just made. The most recent queries I’ve worked on will appear in this list, but search also works.
I’m going to create an alert based on the total number of accounts. In this case, I’ll set it to 690. This is frozen data. So I know that as soon as this query runs again, I’m going To receive an alert notification, but that’s okay. In a real life production environment, we’d be using dynamic data. And we’ll set a custom template, we’ll just say congratulations. And I have ahead of time I’ve written a little template for myself, so we’ll look at that in a second. Now that the query has been generated, and the alert to go along with it, I can configure a destination. By default, my email address is available, but if the organization configures it, I could also click this add button and send pager duty notifications, slack notifications or generic web hooks. So, the next thing I want to do is take that chart I made and add it to a dashboard. So from the Create Menu, I can hit new dashboard. And we’ll just call this SAIS dashboard.
Just like this.
So now I’ll add a widget to this dashboard. Just as with creating an alert on my most recent queries appear here, and we’ll be able to pick which visualization we want. Let’s start with the chart that we made earlier. Let’s just make that a little bigger. And you don’t have to just use one visualization from a query, you can have the chart and the table on the same dashboard, which we’ll do here. We’ll make that a little narrower, drop that over here and make these two height matched. And that’s how easy it is to take a couple visualizations and toss them onto a dashboard. Now, in our ex-demonstration, a little earlier during the keynote, he showed off a monthly recurring revenue query. I’d like to include some of that data on this dashboard as well, so I can search for it up here.
Here’s his MRR breakdown chart. And what I’d like to do is be able to limit this by date, but I don’t want to play around with his query. So I’m going to use this vertical ellipsis menu and I’m going to fork it. This creates a new copy just for me. So what I’ll do is I’ll call this MRR breakdown, with date filter, just like that. One thing that we haven’t looked at so far on this demo, is how you can insert query parameters into your queries. So I’ll show you that now. A query parameter in Redash is any time where you have double curly braces that are surrounding a piece of text. So I’m using this little button below the query editor to manufacture these for me. So you see I have a time range.start and a time range.end. I could also just create something called param. And as soon as I close the double curly braces, the parameter appears down below. Since this is a date range, I can use both markers in order to make it work. So we’ll add and V totals.month, between and in Spark SQL, I have to wrap dates in individual quotation marks. Just single quotes, just like that. And now I can select the time range. So let’s say last year, let’s suppose I run this query this way.
And what I can do now, is anytime I need to rerun this query, all I have to do is change the parameter marker. And changing the parameter marker at the top, will allow me to see well here’s last year. But I could rerun it and look at for this year too. So I’d like to include this table on my dashboard. And I can either go to the dashboard screen and add it from there or I can add it from this screen. I’m going to do that here by clicking this vertical ellipsis and saying add to dashboard. And I just look for the SAIS dashboard and hit OK. And from the notification that appears below, I can jump straight to the dashboard. Here I’m going to edit the dashboard and just make this widget a little wider so we can see everything. Now for the last part of this demo as we wrap up. Right now the MRR chart or table is currently got a parameter. But I also want for these two queries to use the same parameter. So first, I need to go edit this query manually. Just like I did before, I’m going to use this double curly brace button. We’ll create a parameter called time range, just like this. And I’m using the accounts table and it looks like it has a created at field. So I’m going to go ahead and insert that into the query. And we’ll say where created at is between. And just like we did before, we have to use the single quotation marks. Just like that. We’ll set this to last year just to test it.
Perfect. Now I’ve got a parameter that applies to this one as well. So if I go back to my dashboard, now you can see that each of my widgets includes a parameter. So for the final part of this, what I’d like to do is make sure that all of these parameters work together. And I can do this by using parameter mapping. We’ll create a new dashboard parameter that we’re just going to call time range. Just like that. And just like that, the parameter has jumped from the widget straight to the top of the dashboard. And now I can select each of these others and use the existing time range parameter widget. And this will link them all together.
And now, if I want to look at this year versus last year, I just hit this year, and apply changes. And all of these will update. Now you may remember that I created an alert, and I had set my query to run every five minutes. Just wanted to show you if I jump over to my Gmail looks like I’ve got a congratulations alert. Hey, there we go we hit 690 accounts. So we’ve made it. This has been a quick walkthrough of the features of Redash.
So that’s been a walkthrough of the basic features of Redash. Now, Francoise will take you through a case study from our hackathon earlier this month. – So thank you very much, Jesse. This next demo is the result of hackathon projects that we did recently in Databricks. We are using it today as an illustration of how Redash can help you build powerful insights and visualizations from your data.
And then share it across your company to be more data driven. So let’s start with a simple question. Can we compare the popularity of various Big Data engines? Here I named Hadoop, Hive and obviously Spark. Let’s try to use public data. And a great source here is a tool widely used by data engineers and data scientists, which is Stack Overflow.
Stack Exchange, the platform behind Stack Overflow is very nice because they publish a lot of their data on their website. So this data, it’s a few hundreds of gigabytes of bronze level tables, that are stored in a couple of archives 70 files. And that’s where the Databrick’s platform becomes very handy. So I spun up a big cluster with all of memory. To download those archives then extract it from 72 XML and then use spark XML function to convert those files into delta tables. Finally, we have the step to extract the data that we care about. So here we want data about Spark, Hadoop and Hive. To clean it and save the results as a golden Delta table. Then comes the power of Databricks plus Redash integration. I can immediately go on Redash and query the Delta table. It typically takes just a few seconds. And then plot the visualizations, aggregate them into powerful dashboards. So now let me walk you through the process. So first, let’s have a look at the data. As you can see, in that first query, we have in the table two text fields, the title and the tags. Which we use to filter the post rate to spark, Hadoop and Hive. The creation date column will be useful to study the trends of price creation. And we can also use it to distinguish between early questions are typically more general about the technology versus the rate more recent questions that are typically much more specific. Okay, then we have the views, answer and favorite counts, that help us to evaluate the popularity of the best. So that’s a lot very nice data to process. So let’s, next let’s start exploring it.
This query allows us to compare the counts of questions, answers, favorites and views per technology, to know which one is the most popular. Please note that the view, the purple column, the view count has been scaled down by a factor of 1000, to put it in the same chart as the other three metrics. So actually, when you see 120,000 Spark views it’s actually 120 million views. And now if we compare those three technologies, it’s nice to see that Spark is at the top in every metric. Next, what if we want to understand how those technologies are trending? Then we could look at the trends of the number of questions over time. That’s what you can see in that query where we are aggregating by date category. And then we’re computing the number of requests per technology based on the tags. Notice on the bottom how noisy the data is here. And that’s because it’s aggregated per week. And fortunately, we have this dateAgg parameter that allows us to change the granularity, either at week, weekly, monthly, or quarterly or yearly level. So if we select the quarterly level, after a few seconds, then we will see that the data is much less noisy. And it’s nice to see the trend of Spark growing up so much after 2014 to reach a very high level now. Okay, now let’s imagine you want to quickly access all the posts that talk about a specific topic, dear to your heart. The problem is we have hundreds of thousands of posts. So how to retrieve the one that you care about. That’s what the next query enables. And mainly the intelligence here is in the word close that contains a couple of lines. So the first two lines enable us to select for a specific word. So let’s imagine we want to know about Delta, let’s type Delta in type a word. Then the next line, technology enables us to select a parameter that’s related to Spark, Hadoop or Hive technology. So we will only know the only half the post here that talk about Spark. And finally, the last parameter enables us to select the minimum years. So let’s imagine we want to know all the recent post that talk about the most recent topics about delta. So now we can easily see all the top posts, published on Spark Delta lake, since 2019. And if you look at the answer column, you can see that all of the posts here have at least one answer. So that’s, that’s really nice. If you have questions, you can go on Stack Overflow to get them answered. Very good. So finally, really the power of Redash is to allow us to integrate all that data in a dashboard. Which you can refresh automatically and then share across your company. For example, if you look at those two graphs, you could easily study the technological trends. So imagine your company now is trying to choose your next technology stack for big data. You can use those two graphs to argue that Spark is the obvious choice. Which is a good thing for a Spark conference. Now, imagine your company is developing an open source project. And you want your software engineers to know what are the most common user pain points on your platform. So this table is really the great place to explore the specifics. Let’s randomly pick a word so Redash.
And now we can see all of the post that talks about Redash. So that’s really nice. We see that we have around 100 posts, so Redash is getting popular. Let’s now click on the link. So how to set standard SQL Bigquery in Redash. And we are directly reach the Stack Overflow page. So you can see all the details. And if we scroll through, you can see that the first answer is coming from our very own Redash founder, Eric. Small world. That’s it for this demo. Thank you very much for your attention. We wanted to show you what you can build in a two days hackathon project using Databricks and Redash platform. You can easily reproduce that for yourself for your own use cases, as it comes from public data. Redash has been really a great tool for me to explore data quickly and in depth at the same time. It’s really super helpful in my work. And for people who have the chance to work on the database platform. I’m really super excited to seeing Redash getting integrated into it. And I hope that you will make the most out of it. Thank you. – Thank you so much, Francoise. And that’s it for our walkthrough of Redash. Thank you so much for joining in this session. As a final next step for you, I will encourage you to sign up for the private preview wait list at the link that’s shown here.
It should be available this summer, and we’ll open it up to take a few questions.
Jesse is a Customer Success Engineer at Redash based out of Iowa since 2018. He worked previously as a data analyst in industrial automation.
Francois Callewaert received his PhD from Northwestern University, Evanston, in the field of Electrical Engineering, pioneering the use of computational optimization methods for the design of electromagnetic devices. Francois first joined Microsoft as a data scientist for the Office 365 Customer Success Engineering team, where he developed analytics, models and recommender systems to improve the customer support quality. Francois recently joined Databricks as a Data Scientist in the Data Team, where he has been developing analytics tools and models to better understand how the customer and user experiences. Francois leverages the Databricks platform heavily to process large amounts of data and offer powerful visualizations to the Databricks community. His main interests are machine learning, data mining, modelling, visualization and data services to solve customer-centered business challenges, as well as hiking and biking in beautiful California.