Scale and Optimize Data Engineering Pipelines with Software Engineering Best Practices: Modularity and Automated Testing

Download Slides

In rapidly changing conditions, many companies build ETL pipelines using ad-hoc strategy. Such an approach makes automated testing for data reliability almost impossible and leads to ineffective and time-consuming manual ETL monitoring.

Software engineering decouples code dependency, enables automated testing, and powers engineers to design, deploy, and serve reliable data in a module manner. As a consequence, the organization is able to easily reuse and maintain its ETL code base.

In this presentation, we discuss the challenges data engineers face when it comes to data reliability. Furthermore, we demonstrate how software engineering best practices help to build code modularity and automated testings for modern data engineering pipelines.

What you’ll learn:
– Software engineering best practices in ETL projects: design patterns and modularity
– Automate ETL testings for data reliability: unit test, functional test, and end2end test.

Speaker: Qiang Meng

Transcript

– Hello ladies, gentlemen. welcome to this talk on Scale and Optimize Data Engineering Pipelines So this time we will focus more on Modularity and Automated Testing, using Software Engineering Best Practices. As the end of the talk, you will have learned how to make you currently Airflow project to the next professional level. This is a high-level session without any demo. Alright, so my name is Qiang. As you can see from my outfit, I’m a big fan of the fashion industry. Right now, I’m a Senior Data Engineer and Levi is an amazing fashion company with a long history. Firstly, I had also worked in another, great fashion company, H&M Group in Sweden. So how much you know about Levi’s? In 1873 Levi’s produced as a world first chance to meet the miner’s needs for the harder wearing gear. This span is live it’s 501. Today Levi’s is a global company with it’s 110 countries. We have rich heritage, are 167 years of creating some of the world’s most loved Iconic Apparel. And we look to our future by putting our customers’ needs firmly at the center of everything we do. However, a more than just a fashion company. This part, so a challenges this year. Levi’s Data and AI Acceleration, has last to our profitable, certain fiscal quarter. We have increased the footprint of our own life business by more than 50%. And it’s now counts for 25% of the whole company’s revenue. Our Data now across the entire fashion chain from product design to the customer experience. More than 10 AI products has been released in production in a very short time. So how did we achieve this? Especially in the Data Engineering. Alright today, my talk will be a three parts. Firstly, we will look an a little bit of a hypothesis to see like for example, you already have a Airflow project running well and then, what can we do to optimize and scale to bring to the next level? And finally, we’re going to talk about a little bit of Auto-testing for your products. All right so now, let’s move to the first part, Hypothesis. So to begin this let’s say, you already have a great Data Engineer projects running for your company, your team that is an cloud build on airflow and that’s process every day on TB-level of data, by Batch. Okay so you can’t fitch your data from different sources like the cloud services and APIs, or even some on Prime Data warehousing maybe you’ll come here already built decades years ago. Okay? And then, your airflow running far on some different cloud provider services like Azure, VMware, AWS EC two, and so on, so forth. And then you will load in the end your data into data warehouse or data lake that based on your needs of your clients for the data. So let’s have hypothesis. Your projects already have some basic setup to make sure the performances, the code quality and data quality is good, okay? So you have a great development environment. You have local, you have Dev, you have pre-production, you have production. You also have, for example, sample data running in your local and dev. So you don’t need every time to pursue ETL. It’s a whole database to see if you work is working or not. You also have a GitHub set up for your coding. You have also great naming conventions for your pipelines, your data, your schema for whatever. And you have a Dockers Kubernetes Ronnie. Whenever it’s necessary, your spark cluster is running well. It’s a spark cluster can be Ronnie, or some AWS EMR cluster or database and so on and so forth, okay. And then you also have a Grid CICD process and you probably already set up a data versioning. So by chance, your team, I think is some data I mistake. You will always be able to load it back. There’s no loss for you. But today what we are focusing is a more about the code. Okay, so how is your code quality? How to maintain? How to test? How to extend your code? Right? So what you will help in the end. Many teams and the beginnings, AVR setup… some airflow projects with a little bit of complex structure. For example, you have your airflow folder that contains you docs, operators, that’s no kit. And then you probably, you have some util folders or module folders that is something in Coleman for your functionality, right? So some of your docs are doing these things all the time. Some of your operators do it the same sense all the time. You don’t want your code repeated to everywhere, right? So you want to extract them and then put it into different folders, make it like functions, classes, masters. That’s really good. And also you might have some documentations folder to keep some data catalog or some technical documentation for your function operators. And then you’ll , we’re also have some tasks. I’m going here. So the goal for us today in the end, you will keep it as simple as just airflow folders and some other folders. All the rest in the red color should be disappear. When asked to disappear, isn’t means like decoupling. Find your airflow projects and a find a smarter way to automate the process and tested well. So how could we do that? Well, now let’s have some fun with the optimization part. So the first thing before we go to optimization, I just have a want to have a quick review of the professional ETL Pers.dual for most of our fashion, not the fashion does, the data industry and it’s using right now, okay? So a hypothesis, you have a data Lake, that has different zooms. You have Bronze zoom, you have Silver, you have Gold. You have two groups provided important clients. One of them is data scientist. One of them is business analyst. So they will use your data by different manner because data scientists, most of the times, they’ll just get some data and training on the same dataset again and again, but business analysts, they will need to request the data, requests the system and heat in the system very regularly. Okay? So you have some data source, over there you want have your data’s here, okay? So first the stab, you we’ll do a copy passed of your source data. So we’ll ask to copy-pass here. You just gathered Low Data and back up in your system. So next time, if you want to redo the same data, you don’t need to heat the source system anymore because where often the source system, could be very sensitive, right? So everybody’s Visual Data, obviously it can easily crash. All right, so once you have your data copy-passed to your Bronze zoom, and then you have be able to go to Silver zoom. What Silver zoom is doing? You have be able to clean up a little bit data. So source data might have a lot easier. It’s missing primary key. There’s a lot of duplications and you don’t have partitioned columns that can speed up your performances for query. And you also need to do some transformation, right? So source table has 10 columns but I just need five columns. So those five columns can be our drawing of the other two columns, right? So this can be all done in the silver zoom. So once you have your silver data, most of the time, your retail industry, that will be dimensioned tables and fact tables and the new clients will say, wow, I want to have a table that is joined with my customer information, with my product information, with my sales information. And then I don’t need to join by myself. I just need to load the data and train in my modules, okay? In this case, you probably will need to do some aggregation of tables, for your silver zoom and then load them back to your aggregation to Gold zoom. Okay. So there of course different ways of ETL professional data nowadays about, this is like a typical way, many of the big companies using. So after they review the standard way of doing the ETL, we can take a look a little bit more to the next one, okay. So build a Airflow DAGs. So, why is this is important? If you don’t know the professional way of ETL, then many of the junior team of KPI Engineering companies, they might can crazily create you are crazy airflow docs like just what I show here. That’s not the case. So what is the Airflow DAG? Airflow DAG is just a template of tasks for your pipeline to go. That’s supposed to be built on top of the data feeds, which means like, wherever you want Fitch identities that will be a different airflow DAGs. You don’t want how one DAG that is triggering the whole pipeline for every database and run at the same time. That is crazy. That is not maintainable. That is not testable. So if you’ve falling into the standard way of ETL, your airflow tasks can be simple like as only four to five tasks. So the first task, of course you’ll want fitch some Door data, and then second task, you want to clean and transform over your data. And after that you validate your data before you load your data into your data lake and data warehouse. And in the end please don’t forget to collect some metadata. What is Metadata? Like everyday you process the data, but how many of loads of data you collect? Right, so what time you collected? Why is this important? This will be a very important KPI for the team as a company to guarantee the data quality. Okay? So in the beginning you might have one more task. Is that it called skip airflow that is for you to maybe sometimes to control some of your pipeline. For example, you have pipeline running two times a day. In the morning you need fitch the data from USA and China, and in the afternoon, you need only fitch the data from China. Okay? In this case, a skip task can help you to skip. Okay so in the afternoon, I don’t need to go to USA anymore. I just need to get the data for China, okay. Do not do anything in the ETL airflow tax. What does it mean? So tax is only to help you to build a template like your pipeline should do. Which means like a unit code of the box is there’s no things like extract transform a load. So where should we do this code extract and transform a load? That there will be in airflow operators, okay. So Operators is the functions, the classes do the real work. All right. So that could be a classes like, okay. So I might… launch a spark cluster. That’s a operator, okay? I might need to do some are removed of data. That’s a operator, okay? So operator classes, then you will exactly do it a step by step. Like how to load my data, how to clean my data, how to transform my data step by step. And also operators is good same for you to select some Coleman behavior in your doc. So you don’t need to repeat your code base everywhere. So I got two database from different source system, but your transformation is more like the same. Then for the two docs, I can just call the same operator. I don’t need to repeat it anymore. Okay. So that is a benefit of operator. But a Pivot Manny Junior Data Engine team might happen, is like airflow offered us some default operators. That very famous one, a famous one, is like a Paterson operator, right? But some of the company they’re just directly opens is person operator defaulted offered by airflow and then started cuts minded it in this. No, please never do that. Why? Because airflow, we all release different versions later on, right? If you optimize your tasks the source code of the packages, you optimize the customize over there. And then you have a dependency to keep updating your projects to the different level right? So what we strongly recommend to you to don’t touch the default operators, wherever you want to you reach or customized some operators, create something new for yourself, okay. Gave a good name. And then when the new version, so after the release, you will be very fast to upgrade to that level because you didn’t touch anything into packages. And then you just do something extra by your own to add it better. I’m great. So what’s the next things we can optimize? The coupling and modularity is so important. As I showed to you before that you need to have a DAG operators, operators as the classes, right? And probably many teams also have some best practice that is really good, like extract some Coleman functions and Coleman classes that is repeated in the docs and operators. And you load them into some Util for others and Module for others. That’s a really good. Why? This is important for the coupling, this kind of scenes, because you will be able to split this kind of modulars and utils, to a different projects. Why you want split them to a different projects? Because then you can publish this products to a pipeline, right? Is that there’ll be a separate product. You publish a pipeline. And then once it’s become a separate projects, you will have 100% freedom to test and to release as much as you want without touching your airflow pipelines. No one wants to touch too much your airflow pipelines, because they are dealing with data. When it’s dealing with all sorts of data, they will have impacts on the client assigned, right? So you would touch something crashed. Your clients don’t receive data in times, they will yelling at you. But if you match your packages in different products, you tasking whatever you want, you release as much as you want with different versions. What you need to do? Go back to airflow. Is the need to pipe install. Update the version on packages and everything works smoothly like it is, right? So and I’m saying many people forget about the building airflow pipelines for the Data Engineering, is like, well, I do my work at home. So business people, business team gave me a request. “I want this data tomorrow”. Then I just launch our Jupyter Notebook. Explore, do step by step. Once they sit down I copy passed my code, I put it in airflow DAG and doing the work. That is not even functional programming. And that is so far away from Object-Oriented Programming. That’s it’s not your fault because we do explore and build a poker very often with Jupyter Notebook. That is really great too, for us to do this step by step, because it’s caching the results. We don’t need to run person’s raised funds beginning to the end and reload everything. That takes time. Okay, but what are Object-Oriented Programming up here? Is still for decoupling and modularity. And then later on you are been doing you unit test, you will be able to maintain your code for long term, okay? So fun today before making it works, what we need to say to our business colleagues like give us sometime, why? Because we want to take some time, not only to make it a work, but make it work smartly. You are longer term. And trust me, refactoring your code later. Or we are always taking a longer time. Then do a better work at the beginning okay? So for your pipelines what can you do? At least you make some functional classes, right? You don’t like call flies with a freestyle everywhere. You want to make abstracts class for some classes they always have the same and actions, okay. Also in your function classes, you want to a typing check. That’s a cool feature Spartan from persons sweep on sticks, right? So before you’re running, you’ll call them your ID. You might already be able to tell you, like you gave a wrong data type for your variables and you don’t need to wait two hours to run your pipeline. And then you see your pipeline crowd because this kind of very small tiny mistakes you made. Okay? And also you can use a lot of Paterson features like a decorators to make your code. It looks much more simpler and professional All right, so hope checked all writing programming, is good? And then you also need to see about Design Patterns. I’ve got involved with a lot of process for recruiting. Our company is for data engineers. Every time I ask questions about design pattern and they always tell me, “Oh my God, that’s a long time story”. Last time I used design pattern in small, almost 10 years, eight years ago. But it’s high powered. It’s very important, right? So our pipeline is not article how work. We want to maintain with one tasks, our code. We might have some similar data phase come to us next day and we don’t want to rebuild everything from scratch. We can just easily use our function classes that are already built. That’s only difference is for this new database is pointing to a new different location, okay. Then design pattern is so important for us to be able to unit a testable code Each of the function. Each of my cert of the classes, should do only one thing, right? Did you follow up with a single responsibility? Did you follow up some Principle of Open-Close? That is something always to help us in the middle term and long term for the projects And the good things for the Object-oriented programming and data design patterns is like nowadays many like Pye Channel Richards Radio code, already offer you a very good to automatically convert your code into a database, right? You just need to Slack our code base, radically change my code, making my code into masters into classes. And that it will be done in a second, okay? And then another sense like people didn’t pay attention nowadays into there Anthony and Domain, is about readability. I just want to write a code that I can read. I don’t care the other people because I’m not sure like Xavier, you read my code. I business people push me to give some results tomorrow. I just make it worse. But no, most of your time, you call the will be read by the other teams. By the other companies. Oh, by the other strategy. Right? So readability is so important. We don’t want spend too much time to read documentation by hand. If you read a good code, gave on a good naming convention of your wearables name and class name, that is like article, right? So reading and code should offer you a good mood. So wherever you open codes, that is pretty that always making my day really cheerful. Okay, so, Pythons They are already have some global standards like Pylint and so on and so forth That helps a lot for us to format the code. And also you can take some tools like Addie Python, which was still the code and black to format your code in a second, taking considers that this kind of tools, can help you a lot, but that’s not perfect. Which means like 80% of workers will be down. And 20% of work, your skill might need to automize a little bit or the it’s double-check perhaps. Okay. So docstring, yeah. Many people build the docs by hand. No, please don’t do that. You can use Sphinx. For example, for once you have a docstring in your functions, the classes, you will be able to run just a common line and Sphinx will help you to build our professional documentation. Like it’s a also open source projects, no one read document by hand. Okay, that is Auto Generic and Automatically and data catalog it’s the same thing. Okay. So you have some Markdown. Yeah, so you have some markdown, files, built your data catalog, you just need to write some like basic structures and so on and so forth. And then documentation we have been built like a static website, that I shown you here. That is like Google. Right, so if you’ll see on the left-side, there’s a search bar. So for your tables, what time is the produce? The what’s a source table? that table it’s build on top of this. What is the meaning of the columns? Is there any examples? Is this called on primary key column? Right, you just put your column name in the search bar and then you search. And then that can be linked to our website. You build with Sphinx and you can share with this website through all your business clients or even some external clients. Cool, so, so far as think we have seen what, why and how we should optimize our airflow projects with decoupling and modularity. Then how about testing? Testing could be really far. Testing is to cover the use of code quality and data quality. But suddenly nowadays, I now know so many people know, so many company and teams already started out with is there a great testing as Levi’s we did here, okay. So, Levi’s. Once we finished the decoupling process and get out of classes and functions, we will be able to do, so Unit tests, the Integration tests and the End-to-End tests and Smoky Tests. So what is the unit test means? Okay. So unit tests is testing some units. It’s not like you have input data and then you check your output data. You compare. Okay, so if it’s output data, is whatever the amount falling some business, then a know where all look at this pipeline, take only one task going to this operators, see one master deans operators and test only is that mastered. Okay, we might need build some more. Two are similar to some input and output and get it done. Okay, and then in this case, you will be able to test your code in locally for focusing that only unit. You don’t need to wait your pipeline to finish, right? Your pipeline might’ve run for two hours or seven hours, but you just want to make sure like this new function I developed for my operator format actually is working. I wanna have the resulting five seconds, if it works or not. So unit has it’s a way for you to go, okay. There’s a Grade two of Pie test. This open-source to took to help you quickly build our tests coverage report. And to show you a like, what is coward? What is the problem? And so on and so forth. Okay, once you have tested your unit, you are ready to release to production raise. Then you first released through pre-production. You want to run your whole pipeline, right? So I make sure like, my personal DAG is that is working fine. But all both like all the teams updates, all the tax together. Are they going to have a complex between each other. Right then, so Integration Test is very important to run into a pre-production environment to before you release to production, to make sure everything goes like a whole project and more general picture. And it works fine. Okay, and then in the end. End-to-End Test. Is so important for data projects. So how could you verify your Schema? Very often our source system has, gave us the table. That table has 10 columns. And then in the end we have pink columns and created our Schema, okay. But one more thing they certainly have 11 columns. They have one more column to inform us. Then our pipeline might crush because we define our Schema for only 10 columns, right? How to Provences. Then you have to have a philosophy in Syria city process to task your Bronze level with you Silver and Gold level, right? So this one we are fitch some Bronze level data, and then compare ultimately, where’s your Silver go level. If it’s a total number of column it’s the same. If the data time of each column is the same. If there is any change. Sandy or alert or a dashboard or email. So before your pipeline drawing and crush up to two hours, you already immediately notice what is going on and fix it as soon as possible. Right? So once you have all the unit tests, the integration tests to make sure it’s running well in pre-production, you check also your Schema from sales to destination. Then you might want to have our Smoke Tests. That is like a little bit more than your production data. And then to make sure like when you’re released to production you products and services or Archie secretary is strong enough to hold so much of data, Ronnie, and it’s the same time or at different times, right? So that is something automated for you too, to make sure you build, a professional data engineering pipelines, where it goes through a little bit of optimization steps for our pro-products. We also these explain a little bit why modularity and were Software Engineer Best Practices. Like this part is important. I also explain you like what kind of automated testing you can do and the what can you do to make it automated? Well, have covers the points that I need to present today. YouTube is a timely limits. This is quite high level session, but if you are interested in how to set it up, step by step, please reach me out on LinkedIn, would be my pleasure to share and discuss you with more details, okay? At the same time, I’m glad to announce that Levi’s data team is hiring for all our office in USA, in Belgium, in France, in UK, and in China. If you are interested for the fantastic Data and AI journey with us for the next 167 years, please scan the QR code and just see our open positions in details. Cool. So that concludes my talk for today for coming. So now I’d be very interested, to hear your concerns and questions.


 
Watch more Data + AI sessions here
or
Try Databricks for free
« back
About Qiang Meng

Levi Strauss & Co.

Qiang Meng is a data engineer at Levis Strauss & Co. leading a team involved in building future data platforms across the entire fashion value chain from design and production to customer experience.
Qiang has over 8 years of experience in creating enterprise analytics products applying Apache Spark, Hadoop, YARN, and Cloud Services. In the past, Qiang worked on the Big Data Platform at H&M Group and Softbank Robotics Europe.
Qiang gained his MSc. in CS from Telecom Paristech and MSc. in Applied Maths from the University of Paris XI. In addition, Qiang is a fashion designer and a proud member of the LGBTQI community.