Diving Into Delta Lake

Enforcing and Evolving the Schema

Denny Lee. Developer Advocate at Databricks
Denny Lee is a Developer Advocate at Databricks. He is a hands-on distributed systems and data sciences engineer with extensive experience developing internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premise and cloud environments. He also has a Masters of Biomedical Informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise Healthcare customers.
Andreas Neumann. Staff Software Engineer at Databricks
Andreas Neumann is a Staff Software Engineer at Databricks where he focuses on Structured Streaming and Delta Lake. He has previously built big data systems at Google, Cask Data, Yahoo! and IBM. Andreas holds a PhD in computer science from the University of Trier, Germany.

Series Details

This session is part of the Diving Deep into Delta Lake series with Denny Lee and the Delta Lake team.

Session Abstract

Data, like our experiences, is always evolving and accumulating. To keep up, our mental models of the world must adapt to new data, some of which contains new dimensions – new ways of seeing things we had no conception of before. These mental models are not unlike a table’s schema, defining how we categorize and process new information.

This brings us to schema management. As business problems and requirements evolve over time, so too does the structure of your data. With Delta Lake, as the data changes, incorporating new dimensions is easy. Users have access to simple semantics to control the schema of their tables. These tools include schema enforcement, which prevents users from accidentally polluting their tables with mistakes or garbage data, as well as schema evolution, which enables them to automatically add new columns of rich data when those columns belong. In this tech talk, we’ll dive into the use of these tools.

In this tech talk, you will learn about:

  • Understanding table schemas and schema enforcement
  • How does schema enforcement work?
  • How is schema enforcement useful?
  • Preventing data dilution
  • How does schema evolution work?
  • How is schema evolution useful?

What you need:
Sign up for Community Edition here and access the workshop presentation materials and sample notebooks on GitHub.

Video Transcript

– So today, we’re going to be talking about Enforcing and Evolving the Schema with Delta Lake. This is part two of our diving into Delta Lake Series and my name is Andreas Neumann and, my co-host is Denny Lee. So, very quick introduction about myself. I’m a software engineer at Databricks. I dedicate almost all of my time to Delta Lake and also to how you can build data pipelines with Delta. I have been doing this for quite a while. I’ve been building pipelines on Spark since 2014, and I’ve formally done that, at other places like Google Clouds, startup Cask Data, Yahoo! and IBM, I’m originally from Germany. And so that’s why my, my computer science degrees are also from German universities that you’ve probably never heard of. But, with this I’ll pass it to Denny to introduce himself.

– Hey, thank Andreas. Hi everybody, my name is Denny Lee. I’m a developer advocate at Databricks. I’ve been working with Apache Spark since zero dot five zero dot six days. Before this, I was the senior director of Data Science Engineering at Concur, and I’m also a former Microsoftie, being up here in the Seattle area. Ooh, go Hawks, that’s right, I called it out everybody. So, as a Microsoftie, I was involved with the cosmos DB team HDInsight (isotope), and also with the SQL server team, some a longtime database guy. So hence the reason why Delta like is awesome. And then, I have a masters Biomedical Informatics at OHSU and, I’m not as interesting as Andreas. I just have a BS in, Physiology at McGill. Basically Asian parents was supposed to be a doctor, turned out, that was already a bad idea, so there you go. (Denny Chuckles) – I think that’s very interesting though (laughing). Okay, so– – We have the time my friend (laughing softly). – (laughing) All right, so just a very quick recap last week, we discussed the transaction log of Delta and just in case you missed this, I highly recommend that you catch up on this. The video is on YouTube and, we’ll post the link later on.

Overview

So let’s go to, the topic of today’s talk.

What are we gonna to talk about is, that Data, is constantly evolving and constantly changing. All right? Why is that? Because it reflects all the things that we experience. It reflects the business problems and the business requirements that we have. So as those change, the structure of our data changes, right? Now, when that happens, we wanna, have predictability, right? We wanna be in control of how that happens. And it turns out that Delta Lake makes that very easy. Delta Lake, has good ways to control how schema changes and also very good ways of enforcing schema. So, what exactly is schema enforcement? Enforcement and that’s also often called validation, simply means that it prevents us from accidentally, writing bad data to our table, writing data to our table that is not compatible with its schema or with its structure.

But when we actually want to change the schema of the table, then we can use schema evolution. And that allows us to change the schema going forward in a very controlled way.

Understanding Table Schemas

So, what is it table schema?

The table schema describes the structure of our data, right? In Apache Spark, for example, every day that frame has a schema. And if you use Delta Lake, as your storage format, then the schema of that data becomes the schema of the table and it is saved in a JSON format, inside the transaction log. We saw the transaction log last week, right? And what does this schema look like? So it’s, it’s basically, it’s a list of fields, right? Each field has a name.

Oops.

Each field has a type and it’ll also say whether it’s nullable, right? If it’s nullable, that means the field doesn’t have to be present. But if nullable is false, then every record that’s written to the table is required to have this column or this field.

What is Schema Enforcement?

Okay. So, knowing what the schema is, let’s talk about what schema enforcement is. Schema enforcement, also known as schema validation, rejects any writes to a table that do not match the table schema. So, what does that mean? Well, it happens on write, all right, anytime I write, I overwrite, I append to a table. Schema enforcement is applied. And if the schema of the data that I’m writing is not compatible, with the schema of the table, then Delta Lake will cancel the transaction, right?

That means no data is written and this is atomic, right?

You never have a case where some data gets written. Either your entire transaction goes through or it is canceled. And also, Delta Lake will throw an exception and your, basically your job will fail. That’s where you know that there was a mismatch.

Now, what are the rules of schema enforcement?

Schema Enforcement Rules

So, the data that you’re writing to the table cannot contain any additional columns, right? If I have any column that isn’t present in the table’s schema, then the table will not accept this data.

It’s okay, on the other hand, if the data that I’m writing does not contain all the columns, right? Some of the columns can be missing, and then what happens is, in that data that I’m writing these columns will be assigned to null values. And that’s okay, but it’s not okay if those columns are not nullable, right? And a lot of you have probably heard that it’s a very good practice to always make all fields nullable, because that allows you to be more flexible with the data that you write to a table.

In addition, you cannot have columns that have a different data type than what’s defined in the table schema. Right, so for example, if my table says, or this column has type string, but the data that I’m trying to write has type integer, then schema enforcement will fail this write and it’ll throw an exception and no writes will happen.

Another pretty important and tricky case here is, you cannot have column names that differ, from the table schema only by case. So what does that mean? Let’s say, you have a column ‘Foo’, with a capital F, then you cannot have another column, where the F is a small caps letter. And why is that? There’s a little bit of background here. So, this is about whether your jobs are case sensitive or case insensitive. So Spark can do either.

But Parquet, which is the default storage format of Delta Lake, is always case sensitive. And Delta, because it wants to be able to deal with both of this, is case-preserving. So it will simply pass through the case that you give it. But when it stores the schema, it won’t allow you to have two columns that have the same name except for a case. And this is simply to prevent potential mistakes and to prevent unexpected things that might happen to your data if you’re not aware of the case sensitivity issues. All right, so that was schema enforcement.

How Is Schema Enforcement Useful?

And why is schema enforcement useful? What are situations where I need this, right? And it’s basically any time when I run production systems that really depend on the fixed structure of the data that they’re reading, all right? So for example, machine learning algorithms. When I’m training my model, it expects an exact kind of data there, and if that data changes, then I don’t even know what that model means. BI dashboards.

Pretty much any data analytics and visualization tools and any production system that expects a highly-structured and strongly-typed schema.

So, in order to do that, because your data typically arrives in your data center or in the cloud with schema variation, a lot of people build pipelines that sort of employ a “multi-hop” approach. So, the first hop will just like ingest raw data, the next one will filter out bad data, and the next one might canonicalize the schema so that right in the end, when you have your goal tables, all the records in those tables conform to the expected schema. Okay, so now that we’ve talked about schema enforcement, let’s talk about the evolution, right?

What is Schema Evolution?

So schema enforcement was a way that allows us to fix the schema of our data. Schema evolution allows us, to change the schema of our data in a very controlled way. All right? So, it allows you to change a table’s schema to accommodate for data that has been changing over time. All right? And most commonly, this is used for operations like append and overwrite. And in order to do that in Spark with Delta Lake, you use an option, and that option is called mergeSchema, and you just add that option to your write statement. There’s also a way to do this through a Spark configuration. You set the spark.databricks.delta.schema.autoMerge to true. However, so if you do this, you have to be aware that schema enforcement will no longer apply, right?

So, now when you start writing data to your table that has additional columns, for example, the schema of your table will change. And so you need to be knowing what you’re doing. Okay, so with this option, mergeSchema as true, we can basically make any schema changes that are read-compatible. And what does read-compatible mean? It means that existing data in the table can still be read according to the new schema. Right? And so we do this when we append or overwrite the table and the following types of changes are available here. So we can add new columns, right? And the old data that’s already in the table, those columns will simply be null. We can change the type of a column from non-nullable to nullable, right? So that’s basically, that’s a relaxation of the existing schema, right? And so the old data will still fit. And we can do upcasts, right? Where we go from smaller type to a bigger type. So, any bytes can be represented as a short or short can be represented as an integer. So the old data can still be read.

There is a, somewhat stronger form of schema evolution. And for this form of evolution, you would use the option “overwriteSchema.” And that allows us to do schema changes that are not really compatible with the existing data. So, typically we do this when we overwrite the data, right? Because if we only append in the old data and the table becomes useless. But if we overwrite all data, then we can change the schema. So, what types of schema changes can we make here? So we can drop a column for example, or we could change the data type of a column, right? So, something that was previously a string can now be an integer, and that wasn’t possible with mergeSchema. And we can also rename columns. And even if we just change the case of columns, this is all an hour round, if we do overwrite schema.

So, just one more note, in Spark 3.0, there will also be DDL syntax that allows you to alter the schema of a table and that’s gonna be called the ‘ALTER TABLE’ statement.

So, this concludes this part of the introduction, and I’m gonna proceed with a demo.

Which screen is this. Okay, so everybody should be able to see my notebook now. So, I’m gonna use this notebook to demonstrate, how schema enforcement and how schema evolution work. And, I would I would like to say that you can actually try this yourself. All these demos, all, excuse me. Um, so all this will be available for you. You can try and run these notebooks in your own clusters in Databricks. And if you don’t have a Databricks account, you can use Databricks community edition. Right? And all of this will be available after this webinar. The data that we’re using here, is actually public data and it’s from a website called ‘Lending Club’. And it basically describes loans that were funded during a certain time.

Okay, here’s a very quick intro of Delta Lake, and I am not gonna go through all of this. The two things that are important to us here in this talk is that it’s ACID transaction. So we know that rights either go through or they fail. And so our data will always have integrity. And the really important part here is the schema enforcement and the schema evolution. We will also do a little bit of looking at the history of a table and doing a little bit of time travel, just to play around with the schema at different times.

Okay, so to begin with, we wanna show how this would look like if you did not have Delta Lake. So, Delta Lake uses parquet files as its default storage format. And so let’s just, play with some data just with parquet. And the first thing we’re gonna do here, is we’re gonna download this data. Okay, so the data is now here, it’s in this particular location here, and it was downloaded to this location, and that’s the data that we’re gonna be working with now. So, the first thing I’m doing is just a little bit of setup, importing a couple of things and setting some options, and creating a working directory for this experiment. Okay, and so that has happened. What I’m doing now is I’m gonna create a parquet table. Right? And the way I’m gonna create this is I’m gonna read this data, that means it was downloaded, and I’m not gonna do much with it. I’m just gonna write it, write it back using parquet format. And then we’re gonna create a view over this data and this view will allow us to run SQL queries over that data.

So, this’s not running, okay, so this went pretty quickly. And if we look at this table now, lets just look at the first 20 records here. We can see that this table has a schema, right? Every row here has a loan ID. It has a funded amount, they all appear to be a thousand. And then it has the paid amount. This is how much has been paid off. And then it also has the States, where this loan was funded. And this looks pretty uniform and pretty good data.

Let’s see how many records are in this table. So we’re doing this with a simple SQL count query. And that was fast, 14,705. Okay, so this is some number of records that are in this data set, and now, we wanna take this table and we wanna append more data to it, right? And in order to do that, we’re gonna run a streaming query. We’re gonna do structured streaming. And this stream never gonna run is gonna generate random data that also has loan IDs and loan amounts. All right? So for this, I’m just defining a couple of utility functions here. Here’s something that generates the States randomly.

The main method here is this one, right? So, what it does is it uses the Spark format ‘rate’. So what ‘rate’ does is that it simply generates rows. Every second it generates, I think here, it generates five rows. And so each row will have a timestamp and the value, and then we can use those things to create new columns from that, right? So here we take that value and add 10,000 to it, and then we have a new loan ID. And so on. And so, we’re basically generating these four columns here that were in the schema of the table that we saw up here. Okay, so yeah, so this looks pretty much like the same data, same kind of data. And then we started a query that simply, as a streaming query, keeps writing to that table every 10 seconds. All right, so, and I’m also defining a utility to stop my streams when I’m done with them. All right, so let’s run this query. Right? So this should null, every 10 seconds, it should add some data. Let’s look at this query. It’s running, it’s still initializing it seems, but pretty soon we should be seeing some change here. Okay, so now it’s processing data. We can see that here. And so let’s look at this table, let’s count the number of records in the table. Okay, so it looks like it has written some data and it should actually be adding more and more data. Yeah, we see that the number of records in this table is steadily going up. So it looks like this streaming query is working, and it does what it’s supposed to do. But wait, we had 14,705 rows in this table, and where did those go? I suddenly only have 145. This is weird.

Well, let’s look at this data again. Let’s see. So let’s just look at a few of the records. Okay. Oh, my, yeah, so I see that somehow my data is different. This data has different schema. It has two extra columns. It has a timestamp and it has a value. Okay, so where do these come from?

This is confusing, but let’s see. Well, let’s look at this code again. And, if we look at this closely, we see that it uses this format that’s called ‘rate’ and that creates these two columns. And when we say withColumn, then we add more columns to that path. We never drop those two columns and that’s why we now have a schema that has two additional columns.

Well, with this, we now have two kinds of data in the table. We have some data that has four columns, some data that has six columns, and in Parquet file format, it’s really very unpredictable what’s gonna happen and how Spark is gonna read that data. And it turns out that this can lead to data loss, because Spark will simply ignore some of those files that are in the table. So, this didn’t work very well, all right. And now I wanna show you how Delta Lake can protect us from this kind of situation where we are unknowingly dropping data.

Okay, so we’re gonna do a very similar setup as we did for Parquet, but this time we’re just gonna do it with Delta, right? So we’re reading this file again, that we downloaded, but now we’re writing it in format ‘delta’ and we’re writing it to a different path. And again, we’re creating a temporary view so that we can query this with SQL.

So, let’s look at this data one more time. Okay, so it has 14,705 rows. That’s exactly what we expected. And if we look at some of the records, yeah, they look exactly the same like they looked like when we did Parquet. So, just as a quick deep dive into how Delta Lake stores the schema. Let’s look at the files that are actually in the directory that represents this table.

So here we have a Parquet file, and this is the one that was written. But we also have an extra directory, that’s the Delta log, and the Delta log contains Delta’s transaction log. So let’s look at the files that are here. So this should contain ‘commits’. Because we’ve only run one little batch job against this, there’s only a single commit and that commit is represented by this JSON file. All right, so let’s a look at this JSON file, what does this look like? Okay, so this is pretty complex JSON. Let’s look at this in a different way, because Spark has a way to read this as JSON. Okay, now we can see the structure of this. And we can see that this is a pretty nested and pretty rich JSON structure. It has an ‘adds’ field, which says what information was added in this commit. It has a general commit info, and it also has metadata and the protocol. So let’s look at these things in detail. The first one we’re looking at, is the ‘commit’ information.

So this ‘commit’

apparently it was written on a cluster with this ID.

We can see what the isolation level was, we can see the ID of the notebook that we’re in. If you look at my header here, you can see that, this is indeed the ID of my notebook. And we can see as timestamp, and we can see my user ID and my email address. So that’s pretty cool. That gives us a lot of information. Now let’s look at what information was added, what data was added in this ‘commit’. And what we see here is, there was data change, there is a modification timestamp. We also have a path, and this path matches what we saw in the directory listing up here, right? That’s exactly this path, that belongs to this commit.

And we see some stats, right? So the statistics help, the statistics help Delta optimize queries. And we see that these statistics exists for all the four columns, right? The loan ID, the funded amount, the paid amount and the State. Now, let’s look at the metadata. And a metadata it’s actually very interesting. If we look at this, we see that it has a creation time, and it has the format of the file, but most importantly, it has the schemaString. So this is a Parquet or an Avro schemaString, and that describes exactly the fields that are in this table, all right? So there’s a loan ID, there’s a funded amount, there’s a paid amount, and there’s the State. Okay, so now we know Delta records the schema of a table in its metadata.

Let’s run a streaming count on this table, because what we’re gonna do next, is we’re gonna start appending more data to this table. And so running this streaming count, will continuously count the table and update these counts in real time. And the first thing we’re gonna do is, we’re just gonna write data, in the same way that we wrote it to parquet.

This is the same function that called previously, it’s just using table format ‘delta’ now. Okay, and what we see here is that, we get an analysis exception, all right? So spark, refuses to run this query and it’s, and why does it do that? Because the table schema doesn’t match, right? And it actually tells us the difference between the two schemas, right? It’s these two columns. Well, that’s good, right? So we just saw schema enforcement at work, right? It prevented us from upending incompatible data to the schema. Well, this looks great.

Now, let’s fix the streaming query, all right? So now we know that we have to fix this query, because it’s incompatible. So, one thing we can do with this, we can just simply add a projection, as the last statement in the query, which selects only the four columns, that we’re interested in, yeah? So, let’s define this, and run it again, with some other five query. So now, as this query is running, remember we started this query up there, that is constantly counting how many records are in the table. So that’s looking at the status of this thing, and, let’s see.

So this should now update the counts.

Oh, and there we go. Now we see there’s more records in the table, and the counter are actually going up, right? We didn’t start over at zero, like we did with parquet. Oh, this is nice, I like it.

All right, so just for sanity, let’s also run the batch query, and yeah, we also see this as a larger count, right? The nice thing about Delta, is you can have many concurrent jobs, running against the same table. They will not conflict with each other, they don’t get in each other’s way, and they can be streaming and batching, you can mix them as you like. All right, so schema enforcement worked, and now we’re gonna stop the streams for this. And we’re gonna try to do some experimentation with schema evolution.

Okay, so before I do this, I just wanna remember, how many commits I had in this table. So the last commit in this table, after I ran that last three, was number seven. Okay. And we can see that, if we list, we see that in the Delta log they’re actually exactly seven commits. Okay, so we’re gonna remember this number seven, and now again, we’re gonna run the streaming count, to see the counts updating in real time. Now, let’s start this query again. That, so actually, so they actually failed, all right? Yeah, so we saw this, right? It failed, and let’s look again, at the actual error. And so the error tells us the schemas are different, but it also gives us a hint here. It says, well, if you want to have schema migration, which is another term for schema evolution, use this option. So let’s try that. Let’s do that and modify our query, to use this option, right? And so here, we’re just modifying this function, to add one option ‘mergeSchema’ equals true. And then we’re gonna try this again.

All right, so this query is initializing. It’s now running and let’s go up here to the streaming count and let’s see what’s happening here. So we’re counting. Oh, this query failed. And why did this fail now? Oh, it detected a schema change. So, it detected that there’s two additional columns in the table. This is interesting, right? So when the schema changes, we have to restart our streaming query because we can’t just dynamically change the schema while it’s running. And the error message actually tells us, “Hey, why don’t you try restarting the query? This will refresh the schema and then it should work again.” So, let’s try running this counter again.

And, so when we start this as a new query, then it should work. So let’s look at this. And, yeah, so now it can count and we should see the counts going up. In the meantime, I’m also gonna do a batch query just to see that, yeah, that matches, right? The counts here go up and the counts here have also been higher. So we saw that schema evolution worked, right. The moment that I had the ‘matchSchema’ as an option. I was actually able to dynamically change the schema as the shape of my data changed. All right, so after we’ve seen this, I’m gonna stop my streams again. And by the way, if you run this on Community Edition, make sure you always stop the streams when it’s in the notebook because otherwise you might quickly run out of quota in your clusters.

All right, so now after we’ve done this schema evolution, let’s take a look at the history of this table and how it changed over time as the schema changed. So, we’ll look at all the files again. We see that there’s a bunch of new commits now, all right. So here number seven, that was the last commit that I had before I started schema evolution, right. And we see there’s a two minute gap here. So commit number eight, I would think must be the first commit that has the schema change, right? And then from nine on, all these commits must have the new schema. So, we can also see this in the Delta table history. Delta actually has an API that lets you see the history. And here we can nicely display that. We can see all the commits and we can see their timestamp, we can see things like the query ID, the notebook ID. And then interesting thing that we can see here is, between commit number seven and commit number eight, the queryid is changing. And why is that? Because here I started writing with a new query. Alright, and this is also when I started writing with a new schema.

Okay, so now let’s take a closer look at the actual commit points that are in this transaction log, right? And, for this I’m just gonna, so I remembered earlier that the commit number before my schema changed, I think it was seven, right? And so, I’m just adding one for this to get the schema change itself and then everything after that, which is nine and greater, should have the new schema.

And now that I have this, I’m just gonna read these three commits, into the JSON format. And then we can look at these commits in detail. So first, let’s look at the commit before the schema change.

What we see here is it has a bunch of data changes, right? And if we look at an individual one, we can see it has here 32 records in this file and it has statistics for the four fields that were in the table before the schema change. All right? So that’s exactly what we expected. And now we look at the commit that made the schema change.

What does this look like? Okay, so this also has a data change, but wait, this is a data change but it has zero records, like, why are there no records in this commit? If it says there’s a data change, well, the beef of this commit, is actually in its metadata. Right? And to see what really happened in this commit is, we have to look at the metadata. And if we look at the metadata, we will see that this commit has a schema. And in this schema, we have the two new fields. We have the timestamp and we have the value. Right? So, what we see here is that when we do schema evolution, Delta Lake adds, it basically adds an empty commit point. That all it does is change the schema, but it doesn’t really add data, all right. And now we can look at the next commit. And then we can look at the data that it added. And this now again has records in the files that it wrote.

And it also has statistics for the new fields. So, we can tell that, yes, this was writing with the new schema.

All right, so Delta did exactly what we wanted. It recorded the schema change at the right point in time in its transaction log. Now, from last week we know that we can do time travel using the transaction log, right? We can go back to previous versions of the table and see what its state was then, all right. And we kind of we remember the couple of commit versions and so let’s just look at three different points in time in this table. The first one will be like the version when we initialize the table. That was the first time we wrote to it. And then the second one will be the version of the table just before the schema change. And the third one will be now the latest version of that table. And if we count all the records at all these three different times, we will see that yeah initially it was 14,705. At the time of schema change we had more and now the current version has even more records in it. And that was exactly what we expected. The nice thing is we can actually also look at the data. And if we look at the data even though the schema of the table has changed, right? So our schema now includes two additional fields. These two fields, were not present in this commit, right? In c_before that was number seven. So if we read this data now, we will still see the old data and we will see it with the old schema, right? So, this is nice, right? Because if we go back in history we don’t just see the old data at that time but we actually see it with the schema that it had at that time.

Now we go to the time just before the schema changed.

And if we look at this data it’s still, no this is the time when the schema changed, right? This is exactly commit number eight. And, what we see now is that we see the old data, right? This is exactly the same data that we saw before. But now, we see two additional columns. All the values here are null. And why is that? Because we haven’t written any data that actually has these columns, right? And so they default to the null value. And this is exactly what we expected. Now if we go to the latest version of this table or even if we just go to the first version of the table, right after the schema change, right? This was commit after, then we will actually find records where the timestamp was not null. And we will see that here. You see here, and so now we have actual data with the new schema where the new columns do not default to null.

Yeah, so this was all I wanted to show in this demo.

I’m gonna go back to my presentation, and conclude the presentation.

Delta Lake Connectors

So, one thing you might wonder is if I now switch my storage format to Delta, I have all my existing applications, so will I just easily be able to still use that data? And the good news is that there are plenty of tools that have connectors for Delta Lakes. So if you are using Hive or Spark or Presto, even if you’re in the clouds you’re on Redshifts, Athena, Snowflake, there’s a multitude of tools that have connectors for Delta and can actually operate on Delta data.

Delta Lake Partners and Providers

There’s also a lot of partners that will help you or allow you to use Delta in their tools, right? And so there’s, a lot more than these but what stands out here is we have a lot of analytics companies.

Users of Delta Lake

We have data ingestion companies

Delta Lake Partners and Providers

and we also have the big cloud vendors, right? Google Dataproc supports this, Azure Synapse supports this. And so, pretty much no matter where you are, you will find partners who will support you in using Delta. And if that hasn’t convinced you, there is a bunch of companies that are already using Delta.

Users of Delta Lake

And it’s a lot more than the ones shown here. This is just a little sample. So if you wanna join the club and also start using Delta, then ask yourself this question. How do I use it?

Get Started with Delta using Spark API

Well, it’s very easy. So if you have existing Spark jobs or Spark notebooks or PySpark notebooks, all you need to do is you need to add Delta to your packages or to your dependencies. So either you do that on the command line for PySpark or the Spark shell or you do that in Maven if you have a Maven job, and the moment you’ve done that, you can just switch anything that previously was parquet. You can now use format delta there. That is the only line of code that you need to change. And everything else will just work.

Subscribe Today!

If you wanna learn more, this is part of a whole series of talks and I encourage you to subscribe and visit us again next time. The URL is down here, and if you wanna become part of the community and build your own Delta Lake, go to https://delta.io. This is where the open source community lives.

And with that I wanna open it up to questions. – Perfect. Thanks Andreas. That was an excellent presentation. So we have about 10 minutes or so to answer questions. So I’m gonna start with answering some of the Q&A here to give Andreas a little time to breathe and rest meanwhile also refer to the Q&A questions. Or if you’d like me to I can ask various questions to you, for whichever way you like to do. But I will start with a quick question, which is, one person had asked the question. Is schema evolution available in open source Delta Lake and is there any difference between the Delta Lake in Databricks versus the open source Delta Lake? So as a quick call out while we did this presentation in Databricks, everything we’re showing is actually open source available. There is actually no difference between, or I shouldn’t say no difference, but the goal is as will be part of Delta Lake 1.0, there will be no difference between the open source and or managed Delta Lake, the one that’s for Databricks because all of the APIs and those features are actually gonna be exactly the same, okay? So, it doesn’t, that’s absolutely the goal. The quick call out is that there are some potential differences for Databricks itself in terms of the management of it. But in terms of the actual functionality and the API, no, there actually is no difference. And related to this question, there was a question concerning, why didn’t we do this using the SQL syntax? That’s part of the reason why. Because, once Spark 3.0 it becomes available, we can show you running all of this in using the SQL syntax, instead of using the API syntax. It’s just that we’re dependent on some of the features that are specific at Spark 3.0, in order to be able to support that, okay. So I just want to call that out. So, Andreas, would you like me to ask you some questions, or are there any questions inside here in the Q&A panel that you particularly like? (Denny chuckles) – Yeah, I was just struggling with actually seeing the questions because while I’m presenting, I don’t see them. (Andreas laughs) – Oh, that’s right, that’s right. Okay, no problem. So then, you know what, I will ask you some questions and you can just tell me if you’d like me to answer them, or if you wanna go ahead and answer them. How’s that? – Sure. – All right. So the quick call out is that, if I understood this correctly, is our enforcement and evolution in this case mutually exclusive?

– They are not mutually exclusive. It is on a job to job basis. So if your job, in Spark, right? It uses the option mergeSchema, or overwrite schema, then that job will be allowed to change the schema. All right, other jobs where you don’t give that option might not be allowed to do that, right. And those jobs will fail if they try to do that. – Cool. Perfect, and then another question, which is tactically sort of answered, but I figured it’d be good just to call this out, right? Does Delta Lake have a way to track the lineage? And this way specifically is, to know how the schema was stored in relation to the data.

– Yeah, I mean the lineage is implicit in each of the Delta Lake commits and the transaction law, right. When we looked at this JSON structure, we could see that every commit has lineage in it. It can tell you what was the notebook ID? What was the Spark job ID? What was the user ID? What was the email of the user who did this? Right, so when you look at the commit that performed the schema change, you can find all this information. – Exactly. And just to add to your point. Don’t forget that when you’re working with Parquet, the parquet itself contains the schema per se. But the problem, this is the reason why we have schema evolution and schema enforcement, which is to say that, okay, well, while parquet can do it, the reality is that things can change over time. So you need something, that actually has a transaction log that contains all of the potential changes. That way we have an enforcement epic ability. That way we can evolve. We could say exactly to the point of lineage, we can see within the JSON itself in the metadata column, oh, okay! Version zero in the case, this particular case, this is where we went ahead and had the initial version of four columns, by version was it seven or eight. (Denny chuckles) That one is the where the switch made. And then now it has six columns. So you can actually see it within the transaction log itself, basically, okay. A fun one for you Andreas, is it possible to relate to time travel, is it possible to roll back to a certain point in history. – It absolutely is, yes. You just need to go back to that commit. You can read at any point in time.

– Absolutely. Cool. Let’s see, here is another one because we actually have a lot of them, which is basically supporting, talk about supporting schema stuff. So I think you answered most of the questions, so I’m just trying to filter them out. (Denny and Andreas laugh) All right. I think we already answered this one, but again, I think it’s a good call out. In schema validation, is a mismatch of the nullable field allowed?

– So a mismatch of a nullable field is not allowed, if you try to write data that doesn’t contain that field, then Delta Lake will throw an exception because it enforces that non-nullable fields are always there. If the field is nullable, then that’s relaxing this restriction, all right. And that means you can write data that doesn’t have these problems.

– Cool. Then related to that as well is that in terms of, because we’ve been talking about schemas this entire time, does Delta Lake support schema-less mode or is it always running with the schema? And I think you sort of answered that already with the first commit but figured we call this out. – Yes Delta always has a schema, Delta is a schema on right storage from it.

– Perfect, okay. And the all right now related to errors actually with Delta Lake, if for example your reading a message queue or your writing to it and Delta has an error. Does it automatically replay it or do you actually has to handle the missing message now? No, this is more of a unpacking the transaction log question versus a schema question, but I figured it’s still a good call out and also it allows us to shamelessly call out that do go to the playlist (Denny chuckles) and review the previous week’s unpacking the transaction log where we dive into it but I figured at least I call that out right now, so. – Yeah, that’s right. So Delta will not automatically replay this, right? It’ll throw an exception and your job will fail. But the nice thing about Delta is that any of this data that you attempted to write, it’s either all there or it’s not there at all, right? So that means you’re right is atomic. And so your data will always be consistent. You won’t have phantom rights from like failed transactions or anything like that. – Exactly, so just to add to exactly to Andrea’s point here.

Delta Lake has optimistic concurrency model, concurrency control model, excuse me. So if it’s able to allow the right to happen it will let it happen, right? So in other words for example, if there’s an error in which two threads are trying to write at the exact same time there are checks to say that, okay well, you’re trying to read the same time the same table but actually it’s okay because they’re going to two different partitions. Okay, then there’s actually errors that it will automatically retry for you right away and it will take care of it. That’s the optimistic concurrency control. But exactly to Andreas’ point. If there’s an error in which basically, no it really is an error, there’s no way for us to go ahead and process it automatically. It’ll go ahead and actually error out. And most importantly it will not leave any partial written files. So exactly to Andreas’ point about automaticity. It’s either written or it isn’t. So you can trust your data. This goes back to our theme of reliable Delta Lakes, okay. Cool let’s see, I think we’ve got time for probably one more question. So actually a good one that I started like talking about this might go a little long, so we’ll have to make sure you and I both don’t go too long on this one. Schema enforcement evolution usually comes at a performance cost. What’s our take on this?

– Oh, why don’t you to take that, Denny. – Oh, sure okay, nice one. So the context of the evolution and enforcement, right ? In from a performance perspective, it’s actually what’s being written to that transaction law, right? So the fact that number one you actually have a transaction log. So there is a little bit of overhead for doing that. Number two, because we’re actually going here and writing to disk. The issue that you have of course is that, as you can see with the transaction log or the fact that we can have different versions, every single version of the data with its different schemas actually written. So in other words it’s not like, I believe Andreas we had 17 versions, but by the time for the demo was done, right. So you actually have technically some team versions of the data in there, okay. Now, in this case it was relatively simplistic because we can go ahead and it was mostly additions. We didn’t do any updates, we didn’t do any inserts. So in this particular scenario probably it wasn’t that much of a performance hit because of the fact that we’re just simply adding two new columns. And that means all the previous data was just nullable. So yeah, it’s not really not much of an impact. But how about if you were to instead to do updates to that data or deletes, right? There that means what is implied very strongly is that there’s that many more versions of the data and those versions are much larger. So the the potential performance impact is that you have that many more files. Now you can certainly run compaction to reduce the number of files but the reality is you still have more data, okay. And so the that’s ultimately what the performance impact is. So it’s less of a specific to schema enforcement and evolution per se outside the transaction overhead and more the fact that you have that many more files to work with. That’s where the real performance hits gonna be, okay? – And just to add to that, one important distinction here is whether you use merge schema or override schema. Because merge schema basically says that your old data is still read compatible with the new schema. So you don’t need to rewrite any of the old data, right? And you just keep it in the table and when you read that data every time, you can easily morph it into the new schema. So that just happens on the fly and there’s almost no performance penalty. Whereas if you do an override schema then you’re basically forced to rewrite to make a new copy of all your existing data. Because otherwise you would not be able to read that back. And that’s a huge performance penalty. And this is why typically the recommendation is to do read compatible schema evolution because it’s just operationally is easier to handle.

– Perfect. Okay well, you know what? I wanna be cognizant of time so I apologize for any questions that we could not answer, but I hope you definitely enjoyed today’s session and we wanna thank Andreas, but lets head back right to Karen. So let her end the show. – I wanna thank everybody for attending. – Yeah, thanks so much Andreas, a great presentation. Thanks Denny as well for joining us. And thanks everyone out there for joining us. Just a quick call out to, I posted the YouTube links to subscribe, and then also to join our online meetup group. So, I think those are the best places for you to get notified on all of our upcoming tech talks and all the good stuff we have planned coming up in the next few months. So thanks everyone again for joining