Advanced SQL For Data Scientists

May 27, 2021 04:25 PM (PT)

Download Slides

I will begin with a brief overview of SQL. Then the five major topics a data scientist should understand when working with relational databases: basic statistics in SQL, data preparation in SQL, advanced filtering and data aggregation, window functions, and preparing data for use with analytics tools.

In this session watch:
Jean Joseph, Data Engineer, Allied World

 

Transcript

Speaker 1: Welcome to Data + AI Summit 2020. Our topic for today is Advanced SQL For Data Scientists. By the way, I am Jean Joseph. I’ve been in IT for over 18 years. I started as a network administrator, then I became database administrator and developer. After that, I moved to data engineering. This is my blog. This is my email. And then, I’m on Twitter. You guys feel free to follow me and connect with me on LinkedIn. I am located in New Jersey, originally from Haiti.
Our overview for today is a brief introduction to SQL, data preparation in SQL, preparing data for use, the five major things to know in RDBMS, SQL advanced filtering and keynotes.
First of all, we cannot talk about SQL without introducing RDBMS, which stands for Relational Data Management System, where you will be dealing with tabular data where you will see rows and columns. And you will be able to create tons of objects via tables, views, synonym, functions, procedures. And then, there are two type of them: normalization and denormalization. Normalization is where you will find less duplicate data. And then, that will consider things just like OLTP, which stands for Online Transactional Processing. And for denormalization, our data warehouse is what is normally denormalized, where you will be dealing with a lot of duplicate data. And then, things that you need to know is that RDBMS system maintain consistency using ACID.
Okay. What is SQL, guys? SQL is the language that can allow us to communicate with the database. It stands for Query Structure Language. It allows you to control, create, modify object(s) and manipulate data. But what really can we do with SQL? In order for you to master SQL, you need to understand the data language. The data language is what will help you master SQL.
The first data language you need to understand is DDL. DDL stands for Data Definition Language. That, will allow you to create, drop, truncate, alter, comment and rename. The second one is DQL, which stands for Data Query Language. That, will allow you to retrieve data from the database. The third one is DML, which stands for Data Manipulation Language, where you will be able to insert, update and delete. The fourth one is DCL, which stands for Data Control Language. That, will allow you to grant or revoke access. And the fifth one is TCL, which stands for Transaction Control Language. That, will allow you to maintain your own behavior. Instead of using the default one from SQL server, you will be able to overwrite it. Sometimes, a transaction may depend on some other transactions. You can even Savepoint. You can use set up.
However, when you add DQL together with DML, you get what’s called CRUD. CRUD for create, which comes from DML, insert the record. And then, R stands for read, which come from the Query Language which is SELECT. U is update and D is update. And both of them belong to data manipulation language. So as data scientists, should we worry about definition and control language? No. Where should we focus? You should focus on CRUD and you have to master this. And then, you need to have a deep understanding of TCL. Why SQL for data sciencists? Data is the most crucial part of data science, even business, correct? You can’t have data science without data. And SQL is what will allow you to communicate with the database. So you need to be able to master it, right?
This is just a recap. How can you join different tables? You have to think of Join: there’s Inner Join, Left Join, Right Join and then Full Outer Join. Inner Join will return data only where your conditions happen to be true. The Left Join will return data from the first table, even though… If the condition is true or not, the whole data will return, including the data from the second table where the condition happen to be true. Right Join works similar like Left Join, but it goes the opposite way. So it will return data from the second table… Right table, including all the data from the first table or left table where the conditions happen to be true. And Full Outer Join will return data for both the dataset. Wherever the condition is true, you have data. Wherever the condition is false, no data came out.
There’s Cross Join, where let’s say if you have two tables, Table A and Table B, you want the first column to join all the column from the second table [inaudible]. And sometimes you can use it to replace [inaudible] . Beside that, as this is, the demo will be on SQL server. You need to understand the APPLY operation where there is Cross – works just like Left Join. And there is Outer – works just like Right Join. And then also there’s PIVOT that allows you to change the format of the data. Things that we need to know: if you want to join two data set with low conditions, there’s UNION ALL. UNION ALL Is great for that and UNION ALL will give you duplicates. UNION will not give you duplicates that is EXCEPT. EXCEPT will turn the data in a set. Let’s say if you have two tables, A and B, so EXCEPT will return the data and there will be no value, things you need to know. For INTERCECT, there will be overlap within these two tables, right?
And then for data preparation, things that I want you to understand is that data preparation is a pre-processing steps, right? In which data from one or more sources is cleaned and transformed to improve data analysis, correct? But there is stage. The first stage in data preparation is collecting data, right? Data collection technique is information you gathered. It can come from a range of sources like observation, interview, survey, or questionnaire case study that are based, in our case, it will be database, right? During the data collection, you have to determine your objective. What exactly you are looking to achieve. You have to determine what are your samples [inaudible] are, right? So what exactly they are waiting for from you when you’re collecting the data from them. Things that you will need to understand. The more effort you put into data collection, the greater the effort leads to better data quality, right?
The second thing is cleaning data. Cleaning data is where you’re going to spend most of your time. You will need to find all your outliers, missing data, malicious data where you will need to find inconsistent data, formatting data, right? And you will also make sure you bring clean data. But if, for example, you bring bad data, what you need to really understand is that garbage in, garbage out. It’s always best to look at your data, examine it, question your results. And you need to do that, right?
And then, the third thing is re-structing data. Re-structing data can be defined as the group of data, right? Which provide an efficient way of storing and organizing data in the computer so that it can be used efficiently.
Few functions that I want you to understand when it comes to data preparation. There is Position functions, there is Transformation functions, there is Character Set function and there is also Soundex function. If we’re looking at… These are all the functions, guys. You guys can go and do some research. This is a two-hour sessions and I have to bring it down to 30 minutes. So I hope you guys understand me. Okay. The reason why I have this very early is because I want you to understand because everything else will be demo.
The difference between AGGREGATE vs WINDOWING FUNCTIONS. AGGREGATE FUNCTIONS will operate on an entire data set or table, if you will. And you have to use GROUP BY. If you don’t use GROUP BY, then it wouldn’t work out. Whereas WINDOWING FUNCTIONS do not cause work to become grouped into a single output, right? Instead, it’s upward and aggregate value, right? We’ll add it to each row. So this is the syntax for WINDOWING FUNCTIONS, right? And then these are the type of WINDOWING FUNCTIONS. We have Aggregate Windowing Functions and Ranking. We have Value, right?
Let’s go to some demo. Let’s say, for example, you have this data. In real case, we can have tons of data occurring. This is… Just want you to know, we just want you to extract. This is equator. It’s got grid form this is a great event, right? So what happened, the first thing you will need to know is to know where that grid is located. This is where CHARINDEX will help you. So what happened? Let’s keep it simple here. So the expression to find is great. And then the expression to search is this is a great event.
We were able to get that… The location of great, right? Let’s say if the goal was to extract it. So what happenss, with the help of substring, as you can see, we don’t have time to go over the parameter. So what I’m doing here is that the parameter is to pass this… First of all, the string, and then give this application with the help of CHARINDEX we are able to give it. And we know it’s five characters, there’s no need to dynamically maintain it. We put five. So if you’re looking at here, we got five. So let’s go a little bit complex. What happen if you have these data sets where business wanted it to be, for example, you need to find any keywords that start with big, and end with driven.org. How would you do that? You cannot use CHARINDEX, but instead you can use PATINDEX.
And then if we’re looking at here, big data, that will not be displayed, correct? So this is how you can use about PATINDEX. It use regula expression, right? What I’m doing here, I’m dumping the data into a table, big table. This way, we can see that data set. I’m using substring and using PATINDEX, right? And then using CHARINDEX and PATINDEX to dynamically locate where the string is located. And then when I execute this, true enough, you will see the second visual set. We didn’t bring big data at all because there’s no [inaudible] right? And then let’s say, if you have… If business have this format, it want it to be this format. String aggregate should be your best friend. So as you can see, I use String aggregate here, right? I pass the delimiter. I was able to do that, right?
Let’s say, for example, you have this data. So now business want it to be this way. How would you do that? This is where String Split will help you. And it is very simple to use. So what I’m doing here, I am… This is the data I’m generating into a term table, right? And then this is… I’m using [inaudible] against that data. And then I was able to bring the value. So if you’re looking at here, the format, this is what I was able to provide that. So this is the thing that I want you to know. How easy will it be. With very few codes you are able to achieve this. So when it comes to SQL Aggregate Function, there’s tons of them. There’s Aggregate, there’s Analytic, there’s Windowing Functions. Things that you will need to know when it comes to Windowing Function, the default is Range. Range will operate in temp table… In [inaudible] sorry about that.
So that’s been the… There may be performance impact. Whereas rows will be into the memo and I will highly suggest you to explicitly type words instead of use the default one, right?
So let’s go and do some demo. Let’s say if the business requirement, for example, is to provide the total sales. So business want to know of all sales orders for each year. And then this is the year, as you can see. This is year. I’m using year function to get that. And then to calculate the sum of the two to do… I’m using the sum here, right? And then the total average of sales order, I’m using the average year. And then also the total number of sales by here, right? I’m using the count here. And also, we just want to know how well the products are selling relative to other years.
As you can see here, I’m using [inaudible] function and all the decimals of… I want, for example, the year that we made… Most sales happened to be one, correct. And I have to go BY by here again and order it by sales. So order by here, right? When I execute this, this is what I got. Marketing team can quickly see in 2013 was the time we made huge sales. They can quickly reanalyze and see, do we have less customer? Do we still have those kinds of customers? There’s so many things they can go and use that data for. All right. So let’s say if you have to running the daily total and business want it to be for a specific date. In our case, it’s June 1st, 2014 onward, right? And we have to order that data. How would you do that? With SQL it will be very, even though it is an advanced sessions, when you were looking at this, it is very simple.
You see, I’m using Aggregate Function. Sum which is total due. And I order by Sales Order ID and use unbounded preceding. And then this will give me the running total in this in ascending. As you can see, the first sales was $8,78. If you’re looking at the second day, that same day, we saw another thing for $33,12. If you add all of them, you’ll have that 41, right? So this is how easy you can run running total in SQL server. Let’s say if marketing team, for example, asked you to return the first three orders, plus the close price. The Total Order Due, Total Orders for every customer that purchase more than 15 times. So the goal is to get… To find all the details. I’m selecting this data Customer ID and I’m using COUNT, right, and PARTITON BY all the ID and ORDER BY Customer ID to give me the Total Order per customer.
And I’m using [inaudible] ORDER BY Customer ID and PARTITIONED BY Customer ID and ORDER BY Sales Order ID to give me the whole number. Because in WINDOWING FUNCTION, I cannot use work rows. I have to use a Subquery. In my case, I have to use… Sorry about that. CTE to bring that result. So what I’m doing here, I’m selecting from that CTE where row is less and equal to three. And then Total Order is greater than 15. And then sure enough, when I execute, you will see the first screen is all set. And if you look in that, one customer purchase 17 times from us. We’re able to be only the first three times they purchase from us. And then if you look in at the third example, that customer purchased 28 times from us. We were able to bring the first three customers… The first three times or the purchase from us.
Let’s say if there is another requirement where business have asked you to return all orders from the SalesOrderHeader table for any customer who had over 10,000 in purchases, right? So what I’m doing here for the first customer transactions, this is kind of similar, but the thing is they want me to bring up… To return all order from the order details. So that’s… If you’re look at this, we have to find the first three customer because I cannot use workloads here. And then we have to aggregate the first three orders. You see, I declare the first CTE, right? To give me that the whole number and I’m using that CTE and you declare another CTE, right, within that same Query. Okay. And then filter it by where whole number is less and equal to three. Because they want me to include all other details, so I will need to join this table back, this value, with the table. With the original table in order to bring the order transactions, right?
So when I execute this, if you’re looking at this one, sorry about that. If you’re looking at this one, right, you will see… If you add the first three times, the purchase from this Query… From this user set. You will see it’s equal to $10,593, right? And then if you go to the second customer, it’s still the same, right? Marketing team can use that data to go and do some promotion, do some complex operation. And then we think SQL that’s easy and we value less code and you’re only dealing with less code, but you have better performance as well. And then let’s say if this business wants you to find the first and the last close price for each day of every month of the year, including every single details. How would you do that? This is where you have to think of the first value and the last value, right?
What I’m doing here is that I’m partitioning by month and year and order by date. And I’m doing the same thing for this one. When I execute this query, I was able to provide the first close price and the last close price. If you analyze the data for every date, you will see it is still the same, right? We were able to bring the first and the last close price.
So let’s say if you have this data, and then the goal is business school is there got to be this way, right? They need to be calculated for each month and get the subtotal for each year by region and territory. How would you do that? This is where you will think kind of different way, right? So we will use CASE statement to make the magic happen. So what I’m doing here, as you can see, there is no time, guys.
You can go because I already shared that slide. You can go and understand the script, but the resource set of this guy will give you this. For each year, we will… For each month, we give the month total. And then for each year, right, we give the total how much every customer purchase from us. The total for every year and the total for every month for every customer.
As you can see, the script was not that complex. If you have to use Python on or any other languages, you can see how complex would it be. So things you need to know is that as a data engineer, you have to understand CRUD, understand ACID, understand TCL, and also understand Query Optimizer, when to use Index. The things that I always like people to take off, always have a way to handle your exceptions. And then you need to have… The key takeaways is that you guys need to have a deep understanding of ACID. And also you need to master CRUD like DQL and DML. And then you need to have a good understanding of when to use this year command. We appreciate your feedback. Okay. Don’t forget to rate and review this session. Again, thank you so much for your participation. If you guys have any questions, feel free to go ahead and ask. I’ll be happy to provide you the answer if I can. Thank you.

Jean Joseph

Highly experienced in Big Data and traditional data technologies with a history of successes.


Weiterlesen