Speeding Time to Insight with a Modern ELT Approach

Download Slides

The availability of new tools in the modern data stack is changing the way data teams operate. Specifically, the modern data stack supports an “ELT” approach for managing data, rather than the traditional “ETL” approach. In an ELT approach, data sources are automatically loaded in a normalized state into Delta Lake and opinionated transformations happen in the data destination using dbt. This workflow allows data analysts to move more quickly from raw data to insight, while creating repeatable data pipelines robust to changes in the source datasets. In this presentation, we’ll illustrate how easy it is for even a data analytics team of one to to develop an end-to-end data pipeline. We’ll load data from GitHub into Delta Lake, then use pre-built dbt models to feed a daily Redash dashboard on sales performance by manager, and use the same transformed models to power the data science team’s predictions of future sales by segment.

Speakers: Sean Spediacci and Amy Deora

Transcript

– Hi everyone and thanks for joining us for this talk about speeding time to insights with a modern ELT approach. We’re going to be talking about how to combine a couple of tools together. Fivetran, Databricks new SQL service and dbt, data build tool by Fishtown Analytics. My name is Sean Spediacci. I’m the senior product marketer here at Fivetran and I’m joined by Amy from Fishtown, the maintainers and keepers of dbt and she is the Head of Alliances over there. So to get started, let’s talk about well how we kinda got here. When we’re trying to speed time to insights specifically with these modern approaches, we have to think about how the different systems have evolved over time. In 2010, the focus of many analytics teams was just to build these systems that can keep up with just the volume, variety and velocity of data. The three V’s that we’ve always kind of talked about or we’ve always heard about and we always had to figure out how to control those storage costs and how well designed the data warehouse should be performing. We have to think about if we’re following the Kimball model. There’s a lot of things that we have to consider when building these systems but ultimately we need to see how the landscape has shifted. Where we started was these single node databases where we had this constraint on compute. They were generally on premise and we had to pay for them in perpetual licensing. And then when we, because of their limitation on storage and compute, we had to do the ETL outside of the database. We had to buy a whole new systems. We had paying through the nose for perpetual licensing and we decided that eventually we need to get to these scalable database systems. How do we do this separation of storage and compute? You can continue. You can think of this as the Hadoop based world where we’re able to scale up compute and storage independently. And we, but we were paying still them in subscription licensing. And we’re still doing complex pipelines in order to analyze these large data sets. But here comes modern cloud destinations where we’re able to auto-scale these systems and we’re paying only for what we’re using. So I’m no longer having to provision my systems throughout the year and figure out do I have enough for the demand that I’m gonna see a year out from now? Can I figure out that my team is going to need more compute at a certain time? I can independently scale that at any point in time. So that separation of storage and compute allows for agile data pipelines but what does that exactly mean? What does agile get you? Because before we had an ETL world so now we have ELT where you’re extracting the data, you’re loading it into these modern cloud destinations and then you have your separation of storage and compute where you could independently scale up your compute and your storage is as cheap as it’s ever been. So because of that we’re actually gonna start doing transformations inside of the warehouse, where before we had to do that outside. So we didn’t actually red line our box. We’re doing that inside. And we’re using the power of SQL generally to do that to create these clean usable data sets. Now, from an architecture standpoint this is generally what we’re looking at. You have Fivetran who’s bringing in all of these disparate systems: the databases, the applications, the files and we’re doing it in an automated way where you’re no longer having to worry about building individual pipelines and maintaining individual pipelines for each one of your systems or each one of your groups. And then we’re delivering that data to the destination in an ELT format in these normalized schemas that you can get to analyze right away. And then we’re using the power of dbt to actually do the transformations inside of powerful destinations like Delta Lake and we’re also enabling powerful new processes that borrow from DevOps. And we’re bringing in this analytics engineering mindset to verge controlling, documentation, logging, everything that you would generally see in a DevOps environment well we’re now bringing to the analytics environment with the power of dbt. And Fivetran is automating the first step of just loading that data in. And then, so you could eventually do that in business intelligence and data science. And what we have here with Fivetran is that we’re taking advantage of the modern cloud data warehouses. And we’re doing a lot more than what you’ve typically done in an ETL environment. We’re taking these pre-built connectors for applications, databases, events, files and functions and we’re normalizing that data and bringing into the destination for that analysis ready schema where you could do those SQL transformations. Now, along the way what we’re actually doing is automatic data updates where we’re pushing the DML statements directly to the destination and we’re pushing the schema migrations of any DDL statements we’re also doing on your behalf. There’s no user intervention required here. And in terms of how the system is built, it’s item potent. So we’re able to recover from failure at any point in time and take if we run into an issue with the source system or even the destination, we can say we know our point of last success and then restart. And that’s because we’re doing this in a micro-batched architecture where we can take our our last point of success and then just do the whatever has been updated. And we’re doing this in a log-based change data capture approach whenever possible. And if we don’t have a connection, especially for pre-built connectors we can build new connections with extensible cloud functions. And ultimately because we’re taking advantage of that ELT approach we need to also consider the transformations and moving the transformations to a warehouse. And by moving this process to the end we’re no longer locking in our data set where we’re doing these very limited data sets and saying this is the exact data I need for this specific report. We actually had access to all of the raw data. So data transformation is more accessible to all of the data team especially BI developers and data scientists because we have all the data accessible in the destination. And we’re also saying that we can take advantage of the power of the compute in the destination so we can transform data, build new reports. And we don’t have to wait for the ETL process that we generally had to work with other teams in order to do. And this is where I’m gonna pass off. And we’re gonna learn more about the transformation step with dbt.

– So as Sean mentioned for the T step of our ELT data workflow, we utilize dbt. So what is dbt? DBT is an open source data transformation tool that lets anyone comfortable with SQL author their own data pipelines. And so to understand dbt it’s also important to understand what dbt is not. So dbt isn’t a data ingestion tool. It doesn’t pull data into your data warehouse. It doesn’t take data out of your data warehouse. It basically utilizes the compute and storage in your Delta Lake or your data warehouse to do data transformations. And dbt also isn’t just a drag and drop tool. It’s a tool made for folks to work in SQL and to construct data transformations in code. So dbt as we talked about is made to build data transformations in SQL but it provides some tooling and some functionality around that SQL that helps data analysts and data engineers work more like software developers which is what we call the analytics engineering workflow. So first off it helps users write SQL with things like control structures like loops and if then statements. And a lot of those things you might miss if you’re used to working in say Python or R. So it provides that kind of programming environment around writing SQL. It also wraps all of your SQL in the right DDL and DML so that you can materialize data models in your warehouse of choice without having to kind of think about that piece. And probably one of the biggest kind of value adds of dbt to folks that are used to working in kind of data engineering or data analysis is that it infers a data lineage graph as you code. So you can include a reference function in one particular data model. And that reference function is enforced every time you run that model meaning that you don’t have to tell dbt what order to run your models in it infers that automatically and creates those edges in your DAG as you code. And also dbt supports best practices that support CI/CD so multiple environments and git-based Merck version control. So for most users, dbt is gonna help you set up a sandbox environment where you can materialize your models. You can pull down a branch from your github code and develop there. And then as you commit changes to a master branch, you can also run that code in different environments. Let’s say a dev test environment or finally a production environment working closely with your team. And dbt also integrates testing into your pipeline and automates documentation as well. So now I’m gonna talk a little bit about how dbt kind of fits in this modern data architecture that Sean was describing. So, first off we have Fivetran that’s gonna pull in data from multiple sources in an automated way. Fivetran is gonna normalize that data deduplicate that data, clean it and merge it into a consistent schema into a cleanse data bucket on Delta Lake. Then you can use dbt to connect to that cleanse data bucket and to transform your data in that very raw but clean state into a set of data models that are ready for your BI team or your data science team. And you can do that in terms of setting up data marts or setting up data hubs if you want to use a data vault kind of approach. And although we’re gonna talk about a workflow today that’s made for kind of a data analytics workflow that ends up in a dashboard, you can use this same approach to build clean data sets for your data science team as well. So setting up kind of a new data architecture in any company is gonna take some time but we wanted to show you a quick demo of how you can kind of speed time to insight and set up a data pipeline to kind of start working with a new ELT workflow. So we’re gonna do this through ingesting data through a Fivetran automated connector. We’re gonna connect on dbt to Delta Lake to start transforming our data and put that into a set of tables so then my data analysts can work with on the new Databricks SQL service and build a Redash dashboard so that I can get some insight. And to do this, we’re gonna to use a Fivetran dbt package. So dbt packages are a set of open source tools that help you get started with building dbt data models. And Fivetran has published a wide range of packages that work with the different data schemas in which your data comes in from Fivetran. And Fivetran has committed to building about a package a week with the goal of building out different dbt packages for a really wide range of their connectors. So now I’m gonna show an example of how you can set up a data workflow using the ELT structure that Sean and I just described and this uses Fivetran and dbt on Delta Lake and the new SQL analytics workspace in Databricks. So this isn’t meant by any means to be a full tutorial on any of these particular products. It’s just a quick tour to give you a sense of how the different pieces work together and an idea of how you can quickly jump-start the process of getting started with dbt plus Fivetran dbt packages and we have links in the presentation to places where you can find more detailed information about both dbt and Fivetran. And my colleague Jeremy Cohen is also doing a deep dive specifically on dbt in another presentation of the conference. So at Fishtown our mission is to support the open source dbt project. So one way we look at engagement in the open source project and at the workload of our team is by monitoring activity in Fishtown’s github repos. So let’s say I wanted to track some metrics on trends and pull requests in issues submitted. And so I can do that by accessing my github data and I want it to visualize those trends in Databricks. So my first step is going to be to get my github data into Delta Lake. So to do this, I need to set up a github connector in Fivetran. So I can do this through a really easy visual interface in Fivetran. And one thing I can say is I actually never had done this before I’d worked on this demo and it was really easy to set up. So first I set up Databricks as my destination and then I need to set up my connector on github. I can set this up where I can tell Fivetran exactly which schemas I wanna sync and actually which tables that I wanna sync. I can go ahead and set this up to replicate on any type of a schedule, I chose every six hours. And then I can actually see how this is progressing. So for example, I can see that over the past few days this has been replicating. I had a little bit of a booboo here. I see this log where I had a failure and anybody who has worked with data pipelines knows that they do fail. What’s great about Fivetran and really useful is the workflow is item potent which basically means that when you have a failure, you don’t end up with kind of a half-baked mess in your destination. Fivetran is smart enough to quit and start over. And so you have kind of fully synced correct data every time. And I can see here that this takes about 14 minutes to fully replicate all of my data into Delta Lake each time. So now I can go and see my github schema on Delta Lake. And so the process of this EL step in my ELT pipeline is basically just a very complete but a very raw set of tables. So the next step that I need to do is go through that T process so that I can transform these raw tables into a format that I can use kind of for metrics and business intelligence. And I’m gonna do that through dbt. So the first thing that I wanna do is kind of be able to get a jump-start on my dbt modeling. So I wanna do that through downloading a dbt package. So dbt packages are here on the dbt hub. So these are all open source packages. These are a set of tools that help you do kind of common things that you need to do in your dbt project. There’s two types of packages. So some are just utilities packages that help you do things like work with dates and times more easily. And then there’s a set of packages that actually develop kind of sample dbt projects with a set of models on top of a particular data type. So for example, the packages that we have from Fivetran here are of that second type of package. So these are packages that basically work with data in the schema in which the data land in your database from the Fivetran process and build a set of data models on top of that. So I have my github package here so I can see I have some installation instructions. So I need to add a file called packages.yml to my dbt project. I have some documentation down here and a few things I also need to add to another configuration file. So now the next step is that I want to go to dbt and start modeling my data. So you can do this next step in either open source dbt or in dbt cloud. And in dbt cloud, you can either write in the command line or in the integrated development environment or IDE. So I’m gonna develop here in the IDE. So if you’ve used dbt before, you might say, “Hey Amy, you’re using Databricks so you need to connect through Spark and you can’t do that through dbt cloud.” So until recently you were right but we recently added support for Spark to dbt cloud and we have a beta version of a connector using the new Databricks SQL service driver so that we can set this up just like we’d set up any other connection in dbt. So to quickly orient you to the IDE, this is my developed screen and I have three parts of the screen. So one of them is just the files in my dbt project that are version controlled using git so that I can communicate with github here. So commit new changes or even submit a pull request. Here I have my interactive development environment so I can interactively run SQL, look at what my data models are doing and see their results down here. And then finally I have the place where I run my dbt commands so I can execute my dbt commands down right here. So the first thing that I wanna do is install my package. So I’ve actually kind of set that up already here. So I can see that I have a packages.yml file where I call my github Fivetran package and also Spark package that has some useful utilities. I can add as many packages as I want here. And then I want to go and run a command called dbt depths which I’ve already done here. So dbt depths basically just pulls all of the things down from the packages that I’ve installed so that they can run in my project. So now that I’ve got my packages installed the next thing that I wanna do is execute a dbt run command. So what does dbt run do? So dbt run compiles all of my dbt code into pure SQL. It wraps all of that in the DDL and the DML and the other things that I need to run. That SQL in Delta Lake or my database of choice. And then it runs that code as pure SQL. So let’s go ahead and kind of cheat a little bit and go ahead to the end and see exactly what that package is doing when I select dbt run. So one of the great things about dbt is this lineage graph and the DAG that it represents behind the scenes are auto-generated by the references between data models that I’ve incorporated in my code. So I get this just by running my project and just by running my package and the relationships represented here are all enforced every time I run my project. So in the green boxes here, I have the source tables that were pulled into Delta Lake by Fivetran. And then all of these blue boxes represent data models that have been made utilizing the Fivetran package. So one cool thing about dbt is that I can use a wide range of materialization strategies. So for example, these staging models right here these are a way station on the way to getting my metrics. These aren’t things that I’d want to kind of show the rest of my team. So these are materialized as a femoral models, meaning basically they’re just persistent queries that I can model off of. So these won’t show up in Delta Lake. Then I have these set of models here and this is where I actually get to providing my metrics. So these will be materialized as tables in Delta Lake and they provide a set of daily monthly, quarterly and weekly metrics. So I can also access some documentation from this screen. So our friends at Fivetran were kind enough to provide some documentation in the package that runs with the project. So I have a description of my table. I have column descriptions and I can even see the SQL down here that generated this project. So now I can go here to Databricks and I can see that I have a schema called dbt adeora. This is my personal development schema that I’ve asked dbt to set up for me so that I can do some testing and development. And I have these set of tables here that now are accessible on Delta Lake and through all the tools that I have on Databricks. So now I’m gonna go back to dbt. So one thing that we mentioned about the ELT workflow is while the EL part is very automateable, the T part is just fundamentally not as automateable. Every single company is going to have a really distinct set of measures and metrics and a way that they wanna transform their data. That’s just gonna be hard for any kind of automated process to anticipate. So the package is a really good way to get started and to get to a really quick insight but you’re almost always going to want to go beyond a package and build your own data models. So you can do this either by branching off of or building on top of the models in the package or most commonly users end up creating their own custom models, just leveraging a lot of the code that’s in the package to give them a headstart. So let’s have a really quick example of how we can actually build onto the package and add our own new data models on top of it. So going back to our documentation, we can see that we have a set of github daily metrics and then that data model feeds a set of aggregations at the monthly, weekly and quarterly level. So let’s say that we wanted to add to our project a set of models where the data were aggregated annually. So let’s go back to dbt and do that. So the first thing I’m gonna do is I want to create a new data model. So I’m gonna add a file here in my models folder. I’m gonna call this github annual metrics And I’m gonna put on SQL extensions and dbt knows that it’s the data model. And then I already have this logic here from my model that makes my quarterly metrics. So I’m gonna leverage this SQL by just copying it. And I’m gonna go ahead and paste that here in my new data model. So all I need to do is I need to change this SQL code so that I can aggregate data annually instead of quarterly. And I’m going see that. And then I’m going to run my SQL to see what I get. So what’s really great is that now this is part of my dbt project. So every time that I run my dbt project the models that I made in the package run and this model runs along with it. So this is just included in all of the work that I have. So I can see my results here. I have this aggregated by year and then this also will be materialized on Delta Lake the next time I type dbt run. So now that I’ve got my data transformed into a usable state and it’s sitting on Delta Lake anyone in my business can really use quickly all of the tools and the new SQL workspace in Databricks to create visualizations and dashboards. So I’m just gonna go here. And so I can see I have my quarterly github metrics. I can use annual visualization here. Just, this is just using Redash. So I can name this chart whatever I want it to be pull requests by quarter, then that I can choose how I want my x-axis and y-axis to show up. So for example, I have pull requests by quarter and then just over the last month. And then I made this kind of fun visualization that just has the most commonly used words in our pull requests and issue text. So this is just an example of some of the really easy visualizations that you can make in Redash. So now that I have my end to end workflow I wanna automate it. So I’ve already automated the Fivetran ingestion step. So once I’m happy with my dbt project I can also schedule that to run on a schedule. So I’m gonna go back to dbt cloud and do that. So I’m gonna go here to dbt jobs. I’m gonna schedule a new job here and I can set this up in the environment of my choice and I can run any commands that I need and I can schedule this either on specific days of the week, like to run every hour to run every Tuesday at 5:00 a.m and I can schedule it for any schedule that I can write as a cron job. The thing about the workflows in both Fivetran and Databricks is that their item potent, meaning that you get the same thing every time you run it you don’t have to go and delete your tables after you finish or anything like that. And then you can set up your tables to auto update in the SQL analytics space in Databricks. And so your dashboards can update automatically. So set this up once and then you’re good to go.

– All right, that was fantastic Amy. Thanks so much for running us through that demo. I think it, you know, showed a pretty easy flow from one tool to the next and how easy it is to get up and started with a modern BI analytics platform using Fivetran, Databricks and dbt together. And if you wanna learn more folks, please go to fivetran.com or getdbt.com if you wanna learn more about dbt. Thank you very much.


 
Watch more Data + AI sessions here
or
Try Databricks for free
« back
About Amy Deora

Fishtown Analytics

Amy has more 20 years of experience in data analytics and data science, as a hands-on practitioner, and a leader of data science and analytics teams. As Head of Alliances at Fishtown Analytics, Amy builds partnerships between dbt and the other technologies in the modern data stack, and with the consulting firms implementing analytics engineering practices with dbt.

Prior to joining Fishtown, Amy spent ten years at Summit Consulting, where she led the Applied Statistics and Economics Practice, which provided risk analytics and data science consulting to financial institutions and Federal Agencies. Amy specialized in issues related to affordable housing finance, and statistical detection of lending discrimination. Amy then founded the Public Sector Analytics Practice at Civis Analytics, a data science software and services company, bringing behavioral data science practices from the private sector and political campaigns to innovative uses in government. Amy lives in Washington, DC, and holds a Master in City Planning from the Massachusetts Institute of Technology.