In this demo, we give you a first look at Delta Live Tables, a cloud service that makes reliable ETL – extract, transform and load capabilities – easy on Delta Lake. It helps data engineering teams simplify ETL development with a simple UI and declarative tooling, improve data reliability through defined data quality rules and bad data monitoring, and scale operations with deep visibility through an event log.
Delta Live Tables is a new managed service for ETL from Databricks that builds upon our existing Delta Lake and Databricks technologies. It makes a great deal easier for engineers to focus on their data SLAs that they want to meet, without having to manually engineer significant amounts of the pipeline. Delta Live Tables is going to manage your data flow for you, and abstract away a lot of the complicated bits of getting streaming right. It’s going to make it a lot simpler to switch back and forth between batch and streaming, as well as make it easier to take advantage of the incremental processing that streaming provides, and get the benefits of both improved performance, as well as reduced processing costs by minimizing the amount of data that you actually need to process. It also makes a great deal simpler to manage your pipelines, by providing you a new user interface that will allow you to start and stop pipelines through the click of a button. As well as be able to dictate whether or not you want to restart the pipeline, or refresh certain tables in the pipeline. It makes it a great deal easier to come in and deploy updates to your pipelines, and maintains the health of your pipelines overall.
We’re going to aggregate up a lot of information about your pipeline as well, and make it simpler for you to get access to analytics, determining what is actually happening inside your pipeline. We’re also providing an event log where you’ll get information about what’s going on inside the pipeline, and who is starting up that pipeline, as well as information about what’s happening inside the pipeline itself. In addition to providing this information here in the new user interface, we’ll also be providing this information in a centralized Delta table that is deployed per pipeline called the event log table. The event log table is going to contain meta information about your pipeline, such as the duration of various batches’ processing, as well as the result of data quality rules. Delta Live Tables also has data quality as a first class citizen, where you’re able to define data quality rules over your pipeline, and make sure that bad data never makes it into your downstream tables.
At the beginning, it’ll start up your cluster just like a normal Databricks pipeline. Later this year, we’re going to make the pipeline startup time much faster. Once this pipeline has actually started up and gotten running, there will be information surfaced in the user interface showing you what’s actually happening in the current run of the pipeline. We also will be aggregating up information about all the different processes that are running in your particular pipeline, and giving you a centralized single pane of glass where you can come in and quickly assess the status of everything in your pipeline.
We already have the graph showing your data flow displayed here, you can move around and monitor visually assess what is the current data flow graph that your pipeline has. Once the cluster has actually started up, and we start processing data, you get this centralized view showing you the product, the current status of every stream in your pipeline. And if we go over and look at the actual individual tables in our pipeline, we have the same information displayed per table, as well as information about the batch duration, and the result of our data quality rules. These data quality rules can be as simple as a NULL check on a column, or they can be as complex as a user defined function. Anything that works as a DataFrame filter or a SQL “WHERE” clause will also work here as a data quality rule. So they can be arbitrarily complex. At the bronze layer, I just want to get an idea of how bad the data quality is for my raw data. But I’m not going to drop any of this bad data, I’m simply going to allow it to pass through. The types of actions you can take for these data quality rules include things like dropping the bad data, and we’re also later this year, we’ll be having alerts and quarantining as well for bad data.
Then at my silver layer, I don’t want to include any of my bad data, so I’m going to start dropping the data that’s violating my rules. And then I also have a quarantine table where I’ve negated all the rules, and I’m able to capture all the data that is failing one of my data quality rules. I can come back and access this bad data later on, and potentially repair it. If I repair that data, I can simply put it back into my silver table of good data, and the pipeline will automatically pick it up and process it to any downstream tables I have. So if you are building reports or doing further analysis after this, you can simply put in the data that you’ve cleaned, and immediately pick it up.
And in addition to just running your actual data pipeline for you and applying the data quality rules, we’re also capturing all this information about everything that’s happening in the pipeline. And writing that into a centralized Delta table that we call the event log table. In the event log table, you’re going to get information about where this pipeline is running, audit information about what region, what cloud what users are running this pipeline, and you’ll also get more interesting information about the actual events that are happening in my pipeline.
At the beginning, this is relatively mundane, where it’s simply waiting for the cluster to start and setting up the tables. But once we actually get going and start processing data, there’s information that is more useful to us as data engineers. In particular, there are two types of information that are most interesting. We’re capturing lineage information about your pipeline. And what this means is that as you update your pipeline, over time, you’ll be able to go back and look and see what was the lineage of the pipeline at a particular point in time. This makes it much easier to answer questions like, “three months ago, what was the actual data pipeline that was running here, where was data being written to?” And if you find problems with your data later on, you can actually go back and audit these pipelines to be able to determine exactly what transformations were running at this time. Today, we have table level lineage, and later this year, we’ll also be building column level lineage. So here’s an example of the table level lineage, where I have my output tables, and whatever upstream tables are upstream of that output.
Then, in addition to the lineage, we’re also capturing the result of every batch that is processed into your Delta table, and so we’re capturing the result of each individual rule that is applied to the data as well. So I have here the high level number of output rows, in this particular batch there are 10,000 output rows, and then I also have the result of each individual expectation that was applied to this particular batch. I can go through and I can look at this, and I can quickly assess both the result of a particular batch writing to a particular table in this pipeline, but I can also look at the result of rules that may be applied to many tables across many different pipelines. If you’re working with data that has legal regulations on it, oftentimes, you will want to apply the same rules to every place in your data pipeline and make sure that your data is not violating those regulations.
And I can use this to then split out this information per batch of data that’s processed and analyze it at the batch level. So I can come in and identify exactly which rule this data is violating, and then this informs what I should do in order to correct that data as well. Then, since this event log table is itself a Delta table, you can also set up dashboards and visualizations that are querying this event log table in order to answer questions like, how is the data quality of my pipeline over time? Which rules are violated the most? Or where am I getting the most bad data from? If you are ingesting data from many different sources, perhaps that bad data is actually all coming from a broken system somewhere. So here are some examples of various visualizations that you may be interested in building, things like the ratio of bad data to good data at a particular table. Or maybe you want to assess the result of individual rules across all your different tables.
I can also look at a high level and see which tables are capturing the most bad data. Things are going right because most of my bad data is going into the quarantine table. And then we also have the ordered list of batches that were processed by my stream. So I can come in and look over time, and actually look at the result of data quality rules throughout the history of a particular table, and I can come in here and answer questions like, has the data quality been getting worse for this pipeline over time?
And as we build out this product, we’re also going to start incorporating these kinds of visualizations directly into the product, so that we can pre-render these things that are useful and help you make decisions about your pipelines. We have lots of exciting developments planned this year for Delta Live Tables. We’re very much looking forward to our customers interacting with this new product development from Databricks. Thanks.