You Can Do It in SQL

May 26, 2021 03:50 PM (PT)

Download Slides

Just because you can, doesn’t mean you should. But in this case, you definitely should! Learn how this one weird trick (Jinja templating) will supercharge your analytics workflows and help you do more, better, faster with SQL.


 

Transcript

Speaker 1: Hi, everyone. Today I want to try to convince you that you can do it in SQL. My name is Drew Banin and I’m the chief product officer and one of the co-founders at a company called Fishtown Analytics. I’ve had the tagline Mo data Mo problems on my LinkedIn profile for seven or eight years now. And every year that goes by, I think it’s more and more true. If you want to talk to me about SQL or anything else, you can find me on Twitter @drewbanin. So really quick on Fishtown Analytics. We’re the company that maintains DBT, which is open core data modeling software that provides a SQL first workflow for data modeling, data testing, data documentation and more.
You can see the screenshot here is sort of data documentation that’s built into DBT. And so if you like the stock and you’re into the idea of SQL, you should definitely check us out and get dbt.com. Just a couple of confessions and disclaimers before we get started, I have never written a line of production scholar code in my life, but I have written a whole lot of SQL over the past few years. So there’s not going to be any sort of side-by-side Scala data frame versus SQL comparisons here. In general, I think the people who tolerate Python packaging, don’t get to complain about the JVM, so to each their own.
I just love SQL. I fell in love with SQL back in 2014. And up until that point, my experiences with SQL are pretty limited to application development with my SQL as a backing database. I didn’t really understand SQL and I was a little bit afraid of it and it felt like there was a lot that I didn’t know and I couldn’t understand why some of the queries that I ran were pretty fast and others never actually returned. So that all changed back in 2014 when I started doing more analytics, I hadn’t been building an event tracking pipeline and got into sort of AB testing and did a lot of SQL-based analytics using Amazon Redshift back then. So that changed everything for me. And it’s opened up the world of SQL to me. Since then it’s been a powerful tool in my tool belt, and I’ve seen a lot of other folks grow and do really impactful work via the entry point of SQL.
So here’s some of the things that I love most about SQL. For one it’s expressive. SQL reads more or less like a real sentence. And so it makes it really easy to get started with a select statement. It’s a very deep well that you can go down in SQL and there’s a lot to master. We’ll talk about some of those things later today, but at the highest level, if you can read one SQL query, you can read most SQL queries. And I think that’s a beautiful thing. SQL is just so accessible. Lots of people can learn it pretty quickly and I think that’s part of why it’s had this multi-decade long staying power in this sort of data and analytics space. The barrier entry is fairly low for SQL and because lots of people can write SQL and read SQL, it creates these integration points where people can collaborate around SQL for sort of data use cases.
SQL’s universal. Most databases speaks SQL with pretty consistent support across them. I think that if you’ve kind of got a handle on the SQL 99 spec, it’s going to cover 90-ish percent of what you’d see in most databases that speak SQL. And what that means is that whether you’re working with a newer data warehouse or some other tools, you’re going to be able to transfer that knowledge that you have about SQL into whatever database that you’re using. And I think that a lot of this expressiveness and accessibility and universality of SQL really comes from the fact that it’s decorative. So there’s no moving parts. There are kind of limited opportunities for the bad kind of magic that you sometimes see in programming languages and recall, I’ve written a lot of Python in my day, so I know magic when I see it.
Because it’s decorative, you get to just express the dataset that you want to see and you leave the heavy lifting and sort of the data access patterns to the database, which in most cases, these databases can really optimize your queries to get get the data back to you faster for whatever you want to use the data for. Finally, SQL is thought provoking. A lot of people have opinions on where the commas go or indentation or capitalization. Common table expressions versus subqueries. I like that there is this sort of vigorous debate around SQL. I think it makes for a fun environment in which to work and collaborate. So I just want to take a little bit of time and talk about some of the things that you can do in SQL that you might not be familiar with if you haven’t spent a lot of time there, or if you’ve only scratched the surface to SQL.
I can tell you that while SQL is a decorative and sort of straightforward language, it is very, very powerful and it’s pretty rare that I encounter a problem I can’t solve in SQL. At least in the analytics and BI space. So if you’ve written off SQL already, or if you haven’t dug in, super deep yet, hopefully you can take some things away from this next section, and you’ll give it another chance. So the first technique we’re going to talk about is a pivoting columns. So when we talk about pivoting columns, the big idea is that we want to take rows of data and translate them into columns data. We would do that to derive new metrics from the existing rows in the table. So we can look at an example, this is an events table, and so there’s a user who did an event at a certain time. That’s an events table. And if we look at it, the sign-up events and purchase events for a couple or really a few different users.
So we have users one, two, and three, and depending on the user, they’ve either signed up and purchased or only signed up. The big idea is we want to translate this table into a different table that has one record per user, and the date that that user signed up and their number of purchases. So we can do that in CQL using this sort of pivot design pattern. So here’s the code that would do that for you in SQL. It’s a really great design pattern. I employ it all the time. And the big idea is that you want a couple together an aggregate function with an expression. So in this case, the expression is a case when statement and the aggregate function is the minimum or min function. So what we’re doing here is we’re saying if the event is a signup event, then we want to take the happened at timestamp. Otherwise, give us a no value.
And so what this effectively means is if there are no events named signup, then you’re going to get a null value, and the minimum of a bunch of null values is going to be a null. So if you’ve never had a signup event, you will have a no sign up date. Versus if you have had a signup event, then we’ll take the minimum happened at timestamp for those events. We can see a similar example below for purchases. Here we’re doing kind of the same thing, but instead of a minimum, we’re going to do a sum. And so basically it’s just counting the number of events that [inaudible]. So this is one of those examples of SQL being really composable. Maybe you’ve seen mid or some functions before. Maybe you’ve seen case when expressions, but by combining them together, we can kind of implement this new pattern and do some pretty cool things with SQL.
So the second technique we’re going to talk about is appending datasets. So in this example, we’re going to think about unioning a couple of tables together. And so the reason you want to do this is to combine different tables with similar Schemas into a single table that’s sort of taller and has more rows in it. So it’s a bit of a concatenation that we want to do. So here’s an example in which you have two tables, one for web events and one for mobile events. And maybe if the person tracking these events would have asked you, they would have been a single table, but maybe they didn’t ask you and so you have two different tables that you want to concatenate together that would look like all events. So for the web events we see sign up and purchase. For the mobile events we see app install and screen view, just as an example.
So if you want to concatenate these two different tables into a single table that looks kind of more like this, one of the things that would be really helpful is if we can add another column called source that kind of indicates the provenance of these events. So in this case, the yellow rows are the sort of mobile events that came from the mobile events table. The blue records are the web events that came from the web events table. So here’s how you do that in SQL. There’s a really simple operator, it’s called union all, and you can use it to concatenate two different queries together that share a schema. And so the thing that kind of makes this feel like a design pattern is this idea of adding new columns to the expression to indicate the provenance for each part of the union.
And so you don’t have to scratch your head and think about, “Was this a mobile event or was it a web event?” You can just, if it was from the web or from the mobile app. This is also a really powerful way if you have two tables with similar, but not congruent schemas to sort of create new columns or Mitt columns to make sure that the table schemas are congruent and that they could be unioned together. In practice, SQL has a couple other set operators for set differences or set intersections that can be really useful for different kinds of use cases. So the last technique that we’re going to talk about is window functions. If you haven’t seen window functions before, the big idea is that they apply transformations over just kind of a part of a table.
Why would you want to do that? Well, there’s a million reasons why window functions are very helpful. They’re kind of aggregates, but you get to kind of group by different slices of the table or what we’ll call partitions, and turns out there’s a lot of reasons why that can be really helpful. So we’ll take an example. In this case, we’ve got two different users with kind of two different sets of page views. One looked at the product page and signup page. One only looked at, we say, the app page. And there’s also a user touches table that we want to kind of create. So in this user touches table, we want to show the first page path that the user visited, and this is going to be sort of where the user landed in your app.
And so this could be helpful if you’re doing marketing attribution or sort of a landing page analysis. You want to understand where did the user come from on their first visit to the site, like first touch analytics, for instance. So we can do this very, very easily with SQL. And if you don’t have window functions at your disposal, it could actually be pretty hard to do a SQL. So here’s what this looks like. There’s a special window function called first value. And anytime you use a window function, you have the option to specify a partition clause and an order by clause. So our partition clause is going to say, give me the first value for each page path, but only within each group of users. So you don’t want the first page path ever viewed by any user, you just want the first page path viewed for each individual user.
And then how do we know what the first value is? We have to tell SQL how to order the events and so we do that with this border by clause. So in this query, if you run it, we’ll give you this table on the right hand side. It’s a very, very quick way to get the first value out of an event stream. And as we see, there’s also other window functions like the last value or lag and lead, we’ll give you the previous next value. So a lot of these window functions are really, really helpful. The one that I’d want to call out is this row number window function, which helps you kind of assign an index to a bunch of rows. It’s a very good way to quickly deduplicate large data sets. So instead of doing a distinct over an entire dataset, you can kind of give a row number to the different rows partitions, say by like an ID, and then you can select out only the rows with the row number one.
So in practice, that’s a very, very quick and efficient way to deduplicate large datasets, just one of the very many use cases for window functions out there. So I’d be doing you a disservice if I didn’t tell you that there can be some downsides to writing a lot of SQL. SQL is certainly a straightforward language. It makes it easy to pick up. It has a lot of depth to it, but it can lead to scaling challenges if you’re not careful. So the bad parts of SQL, well, as a simple language, it doesn’t really have any notion of functions for code reuse. And so that might either lead to you doing a lot of typing, which could lead to some typos, some sort of syntax errors, things like that, logic errors even. And it might also lead to a lot of copying and pasting where if maybe you’re less familiar with SQL or your colleagues are less familiar with SQL, you just kind of grab some code that worked from one place you copy and paste it over to another place.
This presents a real challenge because anytime that you’re copying and pasting logic or writing out a lot of individual lines of code, there exist the opportunity for business logic to diverge between different parts of the code base. So this is something that SQL actually makes it a little bit challenging to manage, but there is fortunately a really good answer, which is that we can add template. So what does that look like? Here’s an example of some code. This is not standard SQL. This is something you would layer on on top of people. And in this particular example, we’re going to take a page out of a playbook from kind of web development in which you might template out HTML web pages. So what we’re looking at here is a language called Jinja, and it’s a templating language. And this code that you’re looking at would actually run in DBT if that’s something that you’re interested in.
So the big idea is that you can enumerate a list of events. So in this case, page view, sign up and purchase, and then you can loop over those events and generate a valid SQL query. So this is what that generated query might look like. And you can see a totally valid, this actually implements the sort of pivot example that we were looking at a little bit earlier. And so if we go back to the source code, you can see it’s pretty easy to add a new event to the list online too, and you don’t have to touch any of the other code. It all just kind of works. So really cool. One of the ways to help with scaling SQL is to consider templating your SQL. And if you want to go farther down this path, you can actually use things called macros that are kind of reusable, almost functions that return SQL and that’s a great way to abstract out common logic and prevent the need for you to sort of repeat the same code over and over again.
So I would argue that for most problems you might want to solve in data, you can do it in SQL. But just because you can, should you? Well, I’d say yes, I think you should do it in SQL if you can. One of the big reasons is that SQL is just so dang accessible and by writing code in accessible way, you can bring more people into the analytics process. So whether you’re a data engineer or a data scientist or an analyst, you can all collaborate on code together, it helps kind of bring other people into the process. And as a really big added benefit, non-engineers are empowered to understand this business logic in a way that isn’t so true if the code was written in Python or Scala or something like that. It turns out because SQL so accessible, you can get other folks involved in sort of self-serving the answers to their questions. Like how is this column defined or where does this data come from? That’s a really powerful thing.
It sort of removes us as the bottlenecks from the communication process by kind of giving the access to the data or really the business logic to all of the different stakeholders that we work with. So ultimately more people can contribute in more varied and interesting ways and that tends to be a really good thing. So casting our minds to the future, what do we think the future holds around SQL? Well, I believe that more data workloads are going to converge around SQL. I think that we’re already seeing that this is the case. Databases are steadily improving their support for more sophisticated SQL operations like AI and ML type tasks and streaming analytics and maybe some other things that we’re not so attuned to yet that I think might be on the horizon. I think as a result of these workloads, moving to SQL and improved support from databases, we’re going to see that more roles will require SQL knowledge and aptitude for people operating at their most efficient and their most capable.
And ultimately, if we’re going to have more workloads and SQL and more sophisticated SQL operations and more people using SQL, it’s also going to necessitate that there are more and better tools to aid in the writing and managing the SQL. I think that’s really exciting. I think SQL is a great language and I’ve seen it sort of transform my own life and career and I’ve seen it do the same to many others. So here’s to SQL. And at this point, I’m kind of curious what you think. Do you love SQL? Do you hate SQL you think you want to give it another shot? Are there any of your favorite tips or tricks that I might’ve missed? I’d love to hear about them. Thank you so much for sticking around and learning about SQL with me. Hope you enjoy the rest of the conference.

Drew Banin

Drew Banin

Drew is a co-founder at Fishtown Analytics and an open source maintainer of dbt (data build tool). He’s built event collection systems that scaled to billions of events per month, implemented Markov...
Read more