Using Tableau to Analyze Your Data Lake

More organizations are reaping the benefits of the combination of data lakes and data warehouses, in a new architecture known as a Lakehouse. Tableau users can use the new capabilities of Databricks to directly access their data lake and provide high performance access to the most recent data sets from streaming and batch sources. This architecture enables critical use cases like fraud detection and customer analytics that depend on the most recent information. In addition, since more users across the company, including new personas such as business and data analysts, can access all the data, there will be more business intelligence and business insights discovered by bringing them together with data scientists and engineers. See a demonstration of the integration and hear about customers who are using this architecture to drive business results.

Speaker: Blair Hutchinson


– All right, hi everyone. My name’s Blair Hutchinson and you are tuning in to Using Tableau to Analyze your Data Lake. I’m Product Manager at Tableau Software and have been with the company almost five years now. I work specifically on our technology partners team and work with our strategic partners like Databricks, to create solutions for our mutual customers to find success with both platforms. So, I’m really excited to talk to you about that today. I’m also joined by Can from Databricks. Can, I’ll let you introduce yourself.

– Hey everyone. This is Can Efeoglu, I’m in product management at Databricks, looking at our efforts around enabling the IXs objective CPL axis over the Data Lake and making that experience.

– Great. So let’s jump into it and let’s start with the agenda for today. First, we’re gonna talk about the Tableau platform. Then I’m gonna pass it over to John, who’s gonna give an overview of Databricks SQL analytics. Then I’m gonna walk through a quick demo of what it looks like to connect to your Delta tables in Databricks. And then we’re gonna talk a little bit about some customer stories to really kind of pull this all together and really excited about that section to share what success has been for some of our joint customers. Before I jump into starting with the Tableau platform, I just wanna share a statistic, which is that 98.6% of executives indicate that their firm aspires to be a data-driven culture while only 32.4% report success. So that means that it’s nearly unanimous support for the idea that creating a data-driven culture is important, but many companies are struggling to realize that vision. Well, here at Tableau, our mission is simple. We help people see and understand data. And those seven words have guided us since our inception about 17 years ago and have really led us through what was really disrupting the entire BI and analytics industry, when I joined the company in just a few years ago now. And the people part of the mission has always been central to what’s important to us. Making sure that analysts have what they need to explore all of their data without limits. So let’s talk a little bit about the Tableau platform on this next slide. This really encompasses everything that we deliver, and we may have started out as a desktop analytical tool back when I started the company in 2016, but we’ve grown so much more than that. When you bring together Tableau server, Tableau online or web offering experience in the browser, Tableau prep builder, conductor, our embedded skew, a Tableau mobile and the rest of the great stuff that we’re delivering and continuing to work on. We really deliver a breadth and depth of capabilities that our customers need for their end or and analytics experience. And we understand that Tableau, excuse me, the data is really a mission critical asset for your organizations, which is why we built the platform that we built to ensure that you have everything that you need to empower everyone with data without putting your data at risk at all. So as you can see, I call this kind of the cake of the Tableau platform. Collaboration over data can happen anywhere and that kind of starts with really quality analytics, with great power comes great responsibility, and it’s pretty incredible what the Tableau community is able to do with our products. We’ve introduced incredibly powerful content discovery tools just in the last year. Governance continues to get better, data preparation tools. And last on this list is data access, which I wanna return back to. And obviously the flexibility to store your data wherever it lives, and to host Tableau server in whichever environment best suits your needs, whether that’s on premise where your data is, in the cloud, or we also have a SAS offering Tableau online. But going back to that data access piece, that’s really where the magic happens. You can’t see and understand your data unless you’re able to connect to it. And Databricks has been an incredible partner in this space. And in the last year, we’re really, really excited to announce our Databricks connector. So the Databricks connector first and foremost it’s performance. That’s really what Databricks is founded on is being able to access that data faster than ever before. So if we look at what the experience was before connecting, before we had the Databricks connector, we were now able to realize a 12 X and an improved initial connection speed. Our SQL generation is increased 30%. We have a simplified connection interface as well, so that it makes it really easy to connect directly to the clusters that are running your, that are stood up for your analytical needs. And then it’s also available in all of our products. So whether that’s connecting in the web directly in the web or in our desktop products, or if you’re looking to do data preparation as well. The Databricks connector can suit your needs, whatever it ends up being in Tableau. And I’m also really excited that we’re continuing to work with the Databricks team to continue to refine it and make this connector even better to suit more use cases, and to just continue to broaden adoption. So at this point, I’m gonna pass it over to Can, who’s gonna talk a little bit about the Databricks environment, and then I’ll switch over to a demo. But Can, I’ll pass it over to you first.

– Hey, thanks Blair. So, let me give you guys a quick overview of SQL analytics and kind of a few of the things we’re thinking around the vision, as well as the different kind of features that are supporting that vision to get there. So before we even get started about individual features, one thing I kind of want to clarify at a high level is, what is a lake house, that you’ve been kind of hearing throughout this summit. And Lake house is basically a data platform to enable a variety of these cases, not just a few of them, but kind of including like data science, machine learning and ETL, as well as BI directly on top of the Delta Lake. So you kind of end up with less silos with different systems that you’re kind of going back and forth and to enable utmost productivity from that model of working in a seamless, coherent, environment. And today the piece we’re gonna actually focus on is on top of the Databricks platform, right? Things we’re doing to kind of enable the Lake house paradigm from a SQL perspective. So things around, BI dashboarding, reporting, SQL access in general, and we kind of call this initiative at a high level SQL analytics. Our vision basically is three fold here, right? So the first thing is, if you wanna open up the data lake to companies so that they can effectively leverage the data within it to a variety of users, not just engineering folks. So if you’re basically working in partners like Tableau, as well as kind of doing things in our own backend to kind of make this a seamless experience environment. At the same time, if you’re actually building a new experience as part of our UI, so that just like, our data scientists, data engineers, and other more technical folks have our workspace, which is where you do your notebook based analysis, which is where you can create spark jobs. You basically wanna enable our SQL users to have something that they can feel at home with. And these kinds of things, including an SQL workbench type experience kind of capabilities that are on ADO querying, saving queries, and also like dashboarding and things like that. And finally, where you wanna make all of these experiences easy to use with minimal setup so that you can basically come in and enable your folks on top of the data lakes fairly easily, and all of this be basically with the price performance that you would expect from a data lake. So let’s not talk about the different pieces that you’re working on to kind of get to this vision, right? We kind of look at the world in three different lenses in a sense, right? We have the SQL user experience, you have things that you’re helping to do to make the life of admins a bit easier. And we’re also doing, a lot of work to make performance seamless and easy out of the box. So the first topic that I wanna cover is, what you’re doing with SQL user experienced. And to be honest, the first and foremost part is kind of making sure that we work with partners like Tableau and enable Tableau and Databricks and Fletcher spaces that agreed to use. The second thing that you’re actually looking there is actually on enabling a native first-class experience, within AUI for SQL users, so that just like data science staff or data engineers have the workspace, a person who wants to kind of like ADO SQL, kind of save their SQL queries, kind of do some light dashboards individualization also have those capabilities built right with our own interface. And we kind of look at it as different product lenses. So basically this is again building on this, on top of the same Data lake. So the same data sets, standardized governance layer of kind of just offering different lenses depending on the application that you want to do, so that you still get the benefit of a unified platform, but have the simplicity and context of working on a specific use case or something that would require a certain skill without having to go to different tools or double at the same platform. And SQL analytics basically provides just that for the SQL users. We basically have an integrated SQL editing including catalog, where I can browse my metadata, on top of the data lake, all in a secure fashion. I can run and version my SQL queries. And I can also get quick insights by doing some light dashboarding as well as schema learning, all through this integrated SQL editing experience. So that again, without having to move too far away from your data lake you can create insights. And this is an also great experience for folks who wanna kind of create things that can be consumed from the BI tool directly. So imagine a user kind of consuming some of the metadata assets you’re building other assets for consumption from another BI report in a fashion that makes it . Another thing by the way, I didn’t cover is, this is all kind of type thing integrated with the security side as well, so whatever data security mechanism that you have for your existing tables like this will apply to. And this will also apply all the way into the BI tools as well. So you basically have a model where you think kind of set the security on the data lake, and that’d be available across the board, whether you’re using from the integrated SQL analytics UI, or any of the clients that the SQL analytics experience supports right away. Kind of switching gears a bit, around SQL admin experience. You basically looking to make sure that, getting from zero to 10, and enabling a team of users or your whole organization on top of the data lake is as simple as possible. So if we’re basically introducing a new compute option called SQL endpoints, to get us closer to this, right? And SQL endpoints basically provides scalable SQL optimized computes that doesn’t regard to input any spark configuration. You don’t have build against the stacks. You don’t have to do a bunch of configuration. And Databricks basically we’ll ensure that third t-shirt size you pick. The underlying instance and spark configurations reflects the highest price performance fund available. And this is something that we can guarantee for a t-shirt size that make it available without you having to experiment or keep having to switch. Another thing that’s actually builds up the SQL end points is what we’re calling concurrency scaling, right? And this is actually achieved by our multi cluster load balancing technology, right? So as opposed to just kinda adding more worker capacity, we can now actually also scale things linearly. So as you have, as you go from 10 to 20 users to under 200 users and even more, the system can also scale to that without you having to worry about it. So the system will recognize basically loaded queries and SQL endpoints will basically automatically manage as many clusters behind itself as needed to give you that target goal that you need without you having to worry about it. You’re still integrating with a single end point, like you’re sending all of your Tableau queries, your SQL queries from UI directly into this interface. And it basically kind of handles that behind the scenes and make sure that you’re kind of getting the best of both worlds. Beyond that, everything that you’re running through the SQL end points, like let’s say you’re running a Tableau report and you’ve run some SQL queries within the integrated SQL client. And as you run these things, we actually log all of them centrally and make that available to administrators and auditors in an easy to use UI, where almost this is filterable and searchable. So you could basically come in, understand the usage across different SQL end points as well as different users so easily. And also let’s say you’re having a problem with an individual report, that you wanna understand what’s going on with the query. You can actually click into any of these queries and we’ll actually provide a high level overview of what’s our time, like the way the client was coming from, what’s the source of query and all of these high-level details, as well as a breakdown of different efficient details, so that you can pinpoint something very quickly without necessarily having to go through mountains information. We also give you more details, one more layer, but the goal is to give you and get to the insights as quick as possible. And the last topic I wanted to cover around SQL analytics is things that we’re doing on performance. And SQL analytics actually incorporates many of the technologies and the different pieces that you’ve probably heard it’s being announced earlier in the sessions. But I’ll give you a quick overview of how we think about performance and what are the different pieces that we’re actively working on right now. So, let me look at this, we kind of look at, it’s from the state point of life of any query, right? The query starts basically from all the way from the left hand side here. And then as basically, it hits different pieces, right? There are different things and areas that we’re optimizing before it kind of hits the Delta lake. And it starts with the BI SQL client connector layer. Then you’ll have the ODBC JDBC layer, which we call the network layer. And after that, you’re actually in our control plane, right? This is where the sequel end points come in and it gives you a scalable routing service so that you never have to kind of worry about the concurrency or scale as much. And then finally, you’re kind of in spark and photo line, and you’re kind of planning your queries as if you’re your doing your queries before finally kind of hitting the Delta Lake and then returning the insights or the results to the end user. And the things that you’re kind of looking across the stack around is, of course, one of them is usability, right? And this is actually more important on the client connectivity side than anyone in the platform, right? Beyond usability, price performance. So that regardless of what type of for you’re running, you’ll the have confidence of, “Hey, I’m actually getting market leading price performance.” And third, making sure that you’re getting low latencies regardless of the concurrency level you are targeting. And you’ll see us basically with different features and different kinds of capabilities to target these different areas and make sure that the, make proper SQL results. Thank you all. Blair it’s all yours again.

– Can, thank you so much, that was a great overview. And now, before I jump into the demo, let’s talk a little bit about the movement of data from source to Databricks into then Tableau. And the example that we’re gonna use today is using data to predict fraud. So our goal is to be able to provide a dataset for people to explore fraud that’s happened and kind of explore the why, but then also be able to predict which customers are potentially susceptible to fraud. So now on the far left hand side, we’ve got payment networks, that’s gonna be card transaction data. That’s gonna be, you can think of that as streaming data, that’d be entering the Databricks environment. Then we’ve got our databases and CRM information. So that’s all been a land in a Delta table in Delta Lake. Then we can apply ML flow to that in order to predict which people, which customers are susceptible to fraud. Once we’ve then created those tables and done that cleaning, then we’ll use SQL analytics to then connect to Tableau. And that’s when we’ll start to, do some visual analysis with the goal really to make something that’s visual interactable and actionable for someone that’s working as a call call center analyst or something like that, that would be trying to detect this. So with that, now let’s jump into Tableau for a quick demo. So we’re gonna start out in Tableau desktop. This is our primary offering tool, and I’ve opened up our Databricks connection dialog here. So all of this information can be sourced from the Databricks platform. And what I’m doing is I’m actually using the SQL endpoint made available by the SQL analytics to connect to my Delta tables in Databricks. So I’m gonna paste my access token here, and actually I had already gone and done this. Before I jump into the analysis that I’ve already started. I want to highlight the difference between live and extract. Under customers I’ll go over, We’ll actually cover a few that use this, in customer case studies we’ll go through in a second. They use a hybrid approach of using live versus extract. So live, as you would expect, sends queries directly to the Delta tables. Extract is actually going to send one large query, pull that whole table into memory, and it was called a .hyper file and use the Tableau data engine to query that information. You can then schedule those extracts to refresh with Tableau server. So we’ve made our connection and I’ve actually taken the Liberty to actually start us along our journey for analyzing this data. We’ve created a dashboard that kind of gives us a 360 view of some fraud patterns might be happening. So we’ve got our list of customers. We’ve also got some of our States that are seeing the highest amount of fraud. So to me this looks like Texas and California and some of the States in the East coast seem to be the most problematic. And I wanna see if that’s a localized geographical issue or it’s simply widespread throughout the State. And I can take action on that directly in Tableau by exploring this kind of an ad hoc fashion. So I’m gonna bring out amount and I’ll throw that onto size. And for this, I’m just gonna focus in on Texas. Now I can drill down into the different layers of my location data, and I can see, this actually looks pretty widespread, definitely some hotspots in kind of the Houston, maybe the Dallas area. As you can see, just answering these questions, kind of the speed of thought by just dragging and dropping and drilling into this information. So my next step here would be to publish this dashboard, make it more widely available for our call center folks that want to be, are in kind of the front lines to see if folks are more susceptible to fraud or not. And to do that, I’d be jumping over into Tableau server and here I can set subscriptions. I can actually set alerts, maybe if certain individuals cross a threshold, then I want to be immediately alerted. I can also have a conversation around the data so I can add comments to things that I’m seeing, call out particular people to have a discussion around it. I could also go through and edit this directly in the web, as I mentioned. And, that way, then I can actually potentially even connect to more data, more Databricks tables potentially to bring into add to this analysis that I’m doing. So quick demo, I just wanted to show you what that connection experience could look like, and then how you take that into exploratory analysis in Tableau. Okay, so now let’s switch over and talk about some customer case studies. And first start talking about Flipp. Flip is a retail tech company that helps shoppers provide for their families by making life more affordable by reinventing the way that people shop. To do this, they work with large retail retailers and manufacturers to connect them to tens of millions of shoppers through their digital shopping marketplace. And when we talked to Flipp, we heard from them that their previous architecture was just not up to the standards of the business. And they were unfortunately seeing little engagement with the data. Data flows were coming from two different places. They had content data that was coming from their retailers, and they also were gathering event data directly from their applications. That event data was all unstructured and required quite a lot of cleaning and normalization, which really meant that analysts were reliant on the data engineers to just ask some basic questions. They also ran into some scalability issues because of the storage and compute for their relational database was shared. And this ultimately all led to poor performance in, like I said, little engagement. So their new strategy meets those requirements of the business though. And ultimately ended up simplifying the process and that process you can see just in the bottom of the screen here. Their goal was to create a culture of self-service analytics in the identified five key metrics to track, to make sure that they had accomplished their goal. And those were schema governance for event data, decoupling, storage and compute, overall better performance, data discovery, and then also paying per use. And they’re able to accomplish that with a combination of Tableau and Databricks. One thing that I’ll also highlight is, they enable both live and extracts in their environment in Tableau. So it’s part of their strategy. So for their high level use case dashboards, the ones that are really high touch and are more kind of reporting fashion. Those use extracts, whereas some of the more granular level exploration they enable the business to do, those are lives ’cause those are gonna be touching some of the lower level data that might be coming in more immediately than they need to have access to. So really great use case, really awesome to see the success that Flipp has had. Our next one is WehKamp. And similar to Flipp, they’ve really empowered everyone within their company to use data. Now, WehKamp is an online retailer based at the Netherlands. Just to give you a sense of scale, they have about 400,000 products. They have 5,000 daily visitors to their site. They do 650 euros in sales and they ship about 11 million packages every year. So Databricks and Tableau are embedded in every department that they have. And now they have instilled this culture of data literacy that goes all the way from the C-suite down to the folks that are working in the warehouse. It’s pretty, pretty incredible. And the way that they’ve done this is they have a motto of sharing is caring, I love this. What they’ve done is they’ve created user groups to explore how dashboards are created. They create case studies within the business to show how work has been completed using Databricks and Tableau. And they even hosts these fund hackathons where you have about two hours to complete a dashboard with a given data set. And whoever creates the most compelling user, the compelling story with the data, wins the prize. I love that story. They have this motto also that, there is end to end responsibility for folks that are working in both products. So you have folks that are curating data sets in Databricks, and then those results manifest themselves into tables that are usually analyzed in Tableau. And about 65% of dashboards in their Tableau environment are connected to Databricks. And while that may seem low to some people, having worked in sales engineering at Tableau two years prior to becoming a product manager, this is pretty incredible. That’s one of the higher percentages for a single data source that’s not a flat file that I’ve heard of. So that’s incredible adoption. And they also are looking at billions of rows in some of these dashboards. When they demo this to us, they shared an example where they were clicking through from high-level metrics, looking at some of their top level categories, which include about seven. You can think of a women’s fashion, men’s fashion, furniture, et cetera. Drilling all the way down to the 400,000 different articles within those top level categories. So they do this, not only that, just because they can, they do it because they want to be able to answer those questions directly as they come up in their meetings. So they have these dashboards where they started at high level, ask some questions about anomalies that they’re seeing. And then they’re able to figure out kind of the root cause of what might be going on by drilling down really extreme in real time with great performance in these dashboards in the meeting so they don’t need to schedule another follow up, or go collect more data to answer that question. So really, really great story with WehKamp. The final one that I wanted to share today was from the U.S. Air Force. And reporting, and so the problem that they originally ran into was that there were so many different types of technology. They had a multitude of hosting environments, different rules for access, different levels of maturity throughout the business. And data kept multiplying faster and faster, and they really couldn’t keep up with it. So they needed a combination of technology that would help shine a light into corners of their organization that they had never explored before. And again, democratize that data so that more people could be empowered too, and it’s asking into their own questions. And so my first bullet here is mission vault. And so, the mission came from the chief data officer who wanted to empower the department of the Air Force to harness the data for competitive military advantage. So you can imagine this is a pretty mission critical, undertaking that they had. And vault actually is an acronym and it stands for visible accessible understandable linked and trusted. That all boils down to is they wanted to create a data culture by increasing data use and literacy in order to make efficient and effective decisions. And they needed to have the data architecture to support that. So the end result was that they were able to enable 40 different organizations within the Air Force and Space Force do their data analytics. And they ingest 65 million records per quarter for different sources processed in Databricks and visualize them in Tableau to inform leadership on a daily basis. So like I said, this has now become, this is a mission critical part of the Air Force, to see and understand their data the way that they are using Tableau and Databricks. I love sharing these customer stories and I want to actually revisit the statistic that I started out with, which was again, just so 98.6% of executives indicate that their firms aspire to a data-driven culture, while only 32.4% report success. And I hope that through sharing how Databricks and Tableau can work together and sharing some customer success stories that you now feel like you start to see the end of the tunnel, for how to actually make that a reality for your own organization. So Tableau and Databricks, great solution together. We’re continuing to work hard to bring solutions, good solutions to our customers. And thank you so much for having me and have a great rest of your day.

Watch more Data + AI sessions here
Try Databricks for free
« back
About Blair Hutchinson


Blair is a Product Manager at Tableau and works with strategic technology partners, like Databricks, to help delight customers looking to get the most out of platforms. Blair’s been at Tableau for almost five years and has taken his Tableau product knowledge to Seattle non-profits and to the classrooms at the University of Washington. Away from Tableau, you can find Blair in the Cascade mountain range climbing and skiing in the backcountry with friends and family