Who knew time travel could be possible!
While you can use the features of Delta Lake, what is actually happening underneath the covers? We will walk you through the concepts of ACID transactions, Delta time machine, Transaction protocol and how Delta brings reliability to data lakes. Organizations can finally standardize on a clean, centralized, versioned big data repository in their own cloud storage for analytics
What can attendees learn from the session?
Speakers: Vini Jaiswal and Burak Yavuz
– Hi, everybody. We are here to talk about time travel. Time travel is a very sci-fi job, which is why people stay in physics to write about time travel. There are a lot of histories and theories about time travel. So me and Burak today are gonna walk you through time travel and how is it made possible with data. I am Vini Jaiswal, customer success engineer at Databricks. I work with different industry workers to help them with data and AI machine learning use cases. My background in data science also includes working as a lead data science engineer at Citigroup, Southwest Airlines, and I have a bachelor’s from UT Dallas.
– Hello, everyone, and I’m Burak. I’m a software engineer at Databricks in the stream team. Our team motto is we make your streams come true, and we work on structured streaming at Delta Lake. I’m also a Spark committer and happy to be here. All right, so here’s our agenda for today. First of all, we’ll start with an intro to time travel, and then we’re going to talk about some time travel use cases, talking about some data archiving rollbacks use cases, and then we’ll explain how Delta works and how Delta performs time travel, and then Vini is going to take us on a ride on the time machine and give us a demo of how Delta works. So to start off, let’s introduce what time travel is. Like, the essence of time travel is essentially you have your data changing over time, and you want to see, basically, a older historical view of your data. So, here we have an example of the world’s temperatures, average temperatures across the globe, and you can imagine that you could time travel back to certain years to see what the average was to answer this question. So, the kind of use cases that we’re going to look at today are around data archiving, governance, rollbacks and the ability to reproduce machine learning experiments, so let’s go and talk about these real quick. What are data archiving use cases? Well, basically you may need to, for regulatory reasons, for example, store all the changes to your data for many, many years, like seven plus years, and for these regulatory reasons, you might need to be able to go back in time and say and prove why did or took some certain action, and the only way you can do that is if you don’t show the latest version of your data, but a historical version of your data and how you got to that answer. So that’s the data archiving use case. Around governance, our pipelines, our data pipelines in the industry are getting more and more complicated. You might have, people are actually joining data from many, many data sources, and the governance aspect here is that one, you need to know what kind of changes are being made to your tables from where and how, and in case, you know, you have some data quality issues, you might want to go back and understand, you know, what kind of, which data source actually caused, you know, a certain problem at a given date. So this can help you, you know, track your lineage of your operations as well as understand, okay, you know, this day I had this sort of data and it was processed in some incorrect way and added to my table. On top of that, you might also have to prove that, you know, with GDPR, for example, that you do actually respect data subject requests. An author may come over and say, hey, did you delete the data for this person, and you have to be able to show, hey, here you go, this is when I deleted it, and as you can see, at that given time, we had no information about this user whatsoever. The rollbacks use case. It again, you know, relates to the governance use case of okay, so I need to be able to track, you know, which data set caused which issue and understand, you know, what kind of error that might’ve happened. For example, here, you might actually put some accidental code where you overwrite your entire table, and all your historic data is gone. In this case, we basically want to be able to roll back to an historical view of the table or essentially restore the previous version or undo the operation that we just did to actually go back to the time where my data is actually correct, so we’re going to talk about that as well. And last but not least, machine learning experiments are pretty tricky, especially in a life where data moves pretty fast. Essentially, you might want, you know, a fair comparison of your machine learning models and your experiments, and to achieve this in a fast moving world is to be able to, you know, take a snapshot of your data and to be able to run a set of experiments across all these snapshots so that you have a fair comparison, so time travel would allow you to take that snapshot and focus work on that, to get a fair comparison. So, given these use cases, how can Delta help? To start off, we’re going to talk about, real quick, how the transaction log works and how time travel works within Delta, but we’ve had a deep dive on this on Wednesday, yesterday at the same hour, so I urge you to check that out if you already have it. If you have, this next bit is going to seem pretty familiar. So, real quick, how does Delta work? First of all, within Delta, we have our, you know, directory where we store our table. Within that directory, you’ll find that underscored Delta log directory. Within this directory, you’ll find a list of commits that are being made in the form of JSON files. What these files do are they essentially commit the changes that are being made to your table at that given version, and after that, you can also find partitioned directories, optionally, where you store your data, and you might also find your data files, and let’s go over how, you know, Delta provides this, you know, serializability as well as snapshot isolation, and the trick here to time travel is the fact that Delta can give you snapshot isolation during reads. So how does Delta compute at state? When you first query a Delta table, what happens is Delta will, it won’t have any cacheed state in memory of your execution engine, so what it will do is it’ll try to list the transaction log directory, so the Delta log directory, and it’ll submit a request, such as list from version zero. In that case, we might find, you know, a set of JSON files. What we’ll do is we’ll use Spark to read all these JSON files and process actions, the actions within them in order to compute a state, and we’ll catch that version in memory. So what do I mean by the actions that we’re computing? So within these files, you’ll find a set of actions and you might find actions, such as updating the metadata of the table, so the metadata can consist of the schema of your table, the, you know, descriptions or comments that you specify on, you know, columns, the partitioning of the table, so on and so forth. You’ll also see actions that specify which files have been committed to the table. You know, these files will contain the path information, the size information and the partitioning value information, with optional statistics about, you know, the data stored within that file. You’ll see remove file actions whenever you compact files or whenever you, you know, run a delete operation, you’ll see remove file actions that essentially say okay, do not read this file anymore. It’s invalid. We keep these remove file actions in the state as tombstones in order to not delete them, for example, within a vacuum operation, and that’s pretty critical to how time travel works. Then we have our set transactions. Those are essentially when you use structured streaming. Structured streaming can specify which version or which micro batch it has already committed to Delta, and that helps it keep its, exactly one’s semantics. And then we also have the protocol version. Since Delta is essentially a transaction log stored on a, you know, object store, a remote object store, the library, the client that’s actually communicating with the transaction log needs to know if it can understand the transaction log or not, and it understands this through this protocol. So going back, when we had actually computed version seven of our table, let’s say we, how do we update the state from there? Essentially, since we were at version seven, we will list from version zero, and again, we’ll find this list of files, and here you can see that we have a checkpoint, a file that showed up what checkpoints are essentially, they store the aggregated state of your table up to a given reversion. So here, this 10 dot checkpoint will contain all the state, you know, the final state of the table up to version 10. So what Spark can do after that is, you know, it can just read that checkpoint and list all and use all the Delta files after it, so version 11 and 12, to compute and cache version 12 for further parts. So the next time it needs to update, since it knows that the latest version, latest checkpoint that was used was version 10, it can actually list from version 10, see these new files compute and cache the state for version 14. So how does time travel work with this? Essentially, what time travel does is that it goes back and computes, you know, a previous state of the table, and it can return a query based on that tables, based on that version. So therefore, you can time travel Delta tables by version or by timestamp, and to time travel by version you can use, you know, different syntaxes. For example here, the SQL syntax of version as of 1071 Databricks runtime, or the at version 1071 of your, which you can append to your table identifier, or you can also, in Scala and Python, specify the data frame reader options version as of, or you could also specify this, you know, at version and kind of syntax on top of your paths, and this actually, this syntax is pretty helpful. Like, the at syntax is pretty helpful when used together with MLflow, where all you have access is to, you know, you want to use the same code, but you want to change which version of the table that you’re gonna, you know, run the experiments against. You can use this at syntax pretty easily. And given this version number, Delta’s very smart, you know, it’s simple enough to say okay, let me compute the snapshot at version 1071, and we’re good. However, when you’re time traveling by timestamp, you see, we have, you know, similar syntax for supporting timestamps as well. Within the at syntax here, we just have to put this year, month, day, hour, minute, second, milliseconds kind of syntax to specify which time you want to time travel to. And in this case, you can see that we can’t just easily get the snapshot at 1071, because we’re given a timestamp. So how do we figure out which version that timestamp corresponds to? So for that, we’d rely on the modification timestamps that’s basically stamped by the storage system with Delta. So this, you know, whenever you commit a transaction, the storage system will actually, you know, assign a modification timestamp to this file, and we’ll use those as kind of, like, the timestamps to decide where to go. But you know, when you’re using distributed systems such as Amazon S3 or Azure Data Lake, these are, you know, distributed systems, therefore you can actually observe clock skew and you can actually, like, these systems do not guarantee monotonically increasing timestamps. So here, for example, these are times, dates important to Turkish history. You can see that this latest version is actually behind the previous version, and we can’t accept that, so how do we fix it? Well, in this case, if timestamps are out of order, we just add one millisecond to the next version, next person’s timestamp using the previous version. So here, you know, this 1920 will actually become 1923, but just one millisecond later. And when you actually specify your timestamp, the way we pick our version is essentially “Price is Right” rules. We’ll pick the closest commit with the timestamp that doesn’t exceed the user’s timestamp. So here you gave, here we had given 1492 as our timestamp, our latest version that is earlier than the provided timestamp is 1453, so therefore we’ll select version 1071, and then it’s an easy computation to actually just, we’ll probably find the checkpoint 1070, and then we’ll just play 1071 on top of it to address your time travel query. So, given that’s how Delta works, let’s go back to the use cases and see how Delta can solve these problems. So the data archiving problem’s pretty interesting and I wanna dig into a little detail, is that, you know, since you need to store your changes to your data and you need them to be retrievable, one question I would ask is, like, regarding your architecture, should you be storing your changes or just the latest snapshot of your table? And if you have to store many, many years of data, like, how do you make it cost efficient to actually store this many years of data? So let’s look into that real quick. So if we go back to our example with temperatures across the globe, the average temperatures every year, is this really a time travel problem? That’s what we should ask ourselves. Do I actually have to go back to the time that my table was, you know, at the end of 1972 to figure out what the average temperature was at that time? Like, is this really, you know, a time travel problem where I have to take the state all the way back many, you know, tens or dozens of years? Well, it might actually be better to save data by year and query with a predicate instead of using time travel. You know, you could use time travel too. That’s totally fine, but you know, is it the best way? I would argue, you know, for storage reasons, it might not be the best way. So let’s take a look at this in a little more detail. So to introduce, you know, the latest snapshot concept, I’m going to introduce slowly changing dimensions. This is a very popular technique, and, you know, you might contain information around, for example, your users and you know, their addresses and things like that, where they have been living for a given time, and they might move, they might move their internet service to their new address, and you might want to keep track of these changes. You might actually want to go back, you know, where were they five years ago? Am I sending my, you know, mail to the right address and things like that. So here we have the example of a famous Swedish soccer player, Henrik Larsson, and, you know, most of these columns do not change for this user, but you know, the city that he was living in, for example, did change from Helsingborg to Barcelona, and you might want to update, you know, the information of when this information was last updated as well. So if you wanted to go and ask the question, well, where was he living in 2015, well, what you can do is, you know, since you only store the latest state of your table, you really do need time travel, but is this the ideal way to actually store data for my use case? The problem here is that, you know, users typically face a trade off between data recency, query performance and storage costs. To be able to always get the most recent data, to be able to always see, you know, like fresh data, you would want to perform many frequent updates to your table. You want, you might want to stream your updates, for example. If you would like better query performance, then you would have to regularly compact your data, because if you’re trying to keep updates as fresh as possible, you’re probably doing small writes, small changes to your data set. If you’re doing small changes, then over time, you’re just going to have these like very, very small files in these object stores that are maybe 10 records each or, like, five records each and querying, you know, aggregate queries on top of these files will not be pleasant, so you would have to regularly compact these data. So what happens is, you know, when you combine both of these operations, you, with Delta, you would have many copies of your data, because with Delta, files are immutable. All we have to do is, especially, like, when we’re compacting our data, we’ll just add the remove file actions for the files that we’re removing from our transaction log, and then add the one record for the one file that actually has compacted all the information. So in order to do time travel, I would actually need to go back to the state where I actually read the file that contains the 10 records, which is not desirable, because many copies of that data lead to prohibitive storage costs, and this is because how, you know, snapshot isolate with Delta snapshot isolation, we need to have access to the oldest files or, like, the original files to be able to actually time travel. So what can you do instead? Well, with slowly changing dimensions, there is a type two, you know, data pipeline or this, essentially, framework, where instead of keeping the latest snapshot of your data, you basically insert rows for each of the changes that you want to apply, and as you do these changes, what you do is you continuously update the previous record as well to basically say that it’s no longer the latest information. An example, for our example with Henrik Larsson, here, we still keep our Helsingborg data that was last updated in 2012, but we update the value of his latest to, from a yes to a no, and then we insert the latest value, which actually keeps the latest value as a yes. So if I actually want it to query my latest data, it’s very simple. I can just create a view that filters on his latest, and that will give me all the latest information for a given user. But on the other hand, if I actually needed to see where they were living, you know, in 2015, all I have to do is, you know, just write a where query, where the last updated value is less than equals to my, you know, the filter that I want to provide, and this makes it actually very simple to access how you’re, you know, where your users were, you know, access all the changes being made and to be able to actually understand how your data looked at a given point in time, instead of having to do time travel. Regarding our governance use case, we talked about how we want to understand what changes are being made to the table when and how, and Delta provides a described history method, and within our commits, we essentially store information about the commits. For example, which version it is when it’s being made and what kind of operation made that change, as well as the operation parameters that you provided. And you can add additional, like, user metadata to this as well. For example, you might specify, hey, this is a deletion that’s performed for this GDPR, you know, request, whatever, and putting your request ID, and then when, you know, when a regulator auditor comes and says how are you doing your GDPR regulation, GDPR deletions, you can always go and filter this history to prove to them that you are, in fact, adhering to these regulations. So describe history is a very, very powerful tool to, you know, one, keep track of all the changes being made to your table, as well as keeping, you know, additional metadata about these changes. How about rollbacks? We talked how, you know, mistakes can happen, and when mistakes happen, it could be pretty costly to, you know, undo those mistakes. Well, very happy to tell you that with Databricks runtime 7.4, doing work in Delta tables is as simple as a SQL query, and a pretty quick SQL query at that as well. We have introduced a restore command, which will allow you to restore your table to a given point in time, and this is only a metadata operation, so you’re not duplicating any data, you’re not copying any data as long as the data files themselves haven’t been vacuumed. What we can do is essentially restore the fact that they haven’t been deleted or they haven’t been changed. And this is very powerful, specifically. If they have been vacuumed, this restore command also goes and tracks whether these files have been deleted and will tell you, hey, I cannot actually restore to this version because these files have been deleted. There is a flag to, you know, still be able to restore and remove those changes if something disastrous has happened, but, you know, you get all the power of doing however you want to perform this restore. And then we talked about how, you know, like streaming pipelines, for example, how would you, you know, new data gets introduced, you notice that you’ve been missing all this data for, like, a couple of days. Well, that’s actually pretty simple too. So you could restore your target table that you’re writing to to, you know, the timestamp where that change was introduced, and then within your streaming pipeline, you could also provide this for Delta tables, a starting timestamp to actually start processing the changes from. So if your data started, you know, getting this, like, new column and it was actually missed within your data, you can actually start from when it was introduced, exactly from when it was introduced, and as long as you start from a, you know, a new checkpoint location with your stream, you could start, you know, continuing your stream from exactly where it left off and also fix your mistakes or introduce the missed results that you’ve been missing so far. And for reproducing machine learning experiments, you can use time travel to ensure that all experiments run on the same snapshot of the table by, you know, providing this version as of syntax, for example, with the at syntax, and when you’re done and you think that, you know, I actually want to store this snapshot of my data set, and for example, the next year, when I, you know, develop a model for my Black Friday sales, I want to reuse this, you know, same table and check my results, you can use clone, which is also a Delta specific operation that can, you know, clone a specific snapshot of your table to a new location, and that way you can archive it and reuse it when you actually really want to. Keeping, as we mentioned, you know, because of this trade-off of keeping fresh data around versus, you know, not blowing over your storage costs, it’s typically easier and cheaper to just clone your table to a new location, and, but still, you know, not keep one year’s worth of data for, like, machine learning reasons. And MLflow integrates very nicely with Delta tables. MLflow’s auto logging feature will actually keep track of which version of the Delta table was used when running a given experiment, and that also really helps simplify and trace which versions you should use to get, you know, apples to apples comparisons of your machine learning algorithms. And, to add a bonus, time travel also allows you to perform time series analytics. I mean, it’s not, you know, it’s a bit clunky, to be honest. However, if you want to run a very simple query of, hey, how many new users do I have since yesterday or, like, since last week, it’s very easy to actually write a, you know, count distinct user ID query, where you can, you know, embed your time travel query within a sub query and get your results as well. So right now, I’m going to pass it off to Vini, and she’s going to take you on a ride.
– Thanks, Burak for the awesome presentation. We learned the theory and concepts about time travel. Being fascinated with the time travel concept myself, I’m excited to take you through the ride. It is uncertain if time travel to the past is physically possible, but I will show you how it is done with data. To understand the whole time travel concept, our session will take you through what the time machine is. Of course, you need a vehicle to time travel, right? I will take you through the actual timeline, and then we will look at different methods to do time travel. I am using the Databricks notebook here, but you can use community edition or notebook of your choice because data is open source, and I will have this notebook available for you in the GitHub repo for you to reference. So let’s explore the data’s time machine. First, we will look at what it is made of and understand the nuts and bolts of data. This is required to make a future proof on data repo. So we will go through data, data logs directory, which includes logs or transaction protocol, snapshots, et cetera, and then we will look at the history. So as a data scientist or data engineer, the first thing I want to make sure is I have a good understanding of what the data is, schema and et cetera. So, here I’m going to use FSLS to explore the contents. Customer_t2 is my table, which is a randomly generated TCDBS, customer data stored in data. Let’s use this command, and I’m running a 7.4 version here, because in Databricks notebook, you need to have a cluster running for performing any operations, so I’m gonna use the compute power from there. So as it runs, it is actually fetching the records behind the scenes. And now, let’s explore the data, because as you know, data use is partly under the hood, and it adds a layer of transaction on top. We can see, as we can see from the results, it contains _data_log directory, which contains all the transactions, and then we can see snappy parquet. There are a lot of part files because of data’s property, right? Now, let’s say the naming convention. Usually, the reference implementation by default uses a globally unique identifier in it, in the name to ensure its property. Data files must be uniquely named and must not be over return. Awesome, so now let’s explore the data. As I said, as a data engineer or a scientist, first thing I would want to know is what kind of data I’m dealing with. So I really want to know about the schema, metadata and just a sneak peek of the table details. So because we have amazing leverage of Spark APIs, so I can just run my SQL query here, give the Delta path and just explore the data schema. That’s what I’m doing here. Let it run. Just while it’s running, my table actually contains a sample customer table with 65 million rows amounting to about three gigabytes of data. So, it finished running the results. My customer table contains customer data, which is customer birth information, customer name, customer gender, et cetera and I also have transaction information. Now, let’s look at the metadata of the file. All the meta data for a Delta table is stored alongside data, which is due to its self describing property. Delta league also offers rich features for exploding table metadata, so you can see that what format your data is in, the ID location, where it is stored, what time it is created, what time it was modified, partition columns, which is very important for benchmarking, number of file size in bytes, et cetera. This information is really critical to understand how big of a data you are dealing with in partition size. Now I have this described table commands here, and the reason I have that is Delta also allows you to use describe table, because for any operations on your data, you want to ensure that computer, computable columns are in the right format. So, as you can see, the results of this query are actually returning the column name and datatypes of each column, which can be helpful for the analysis. So far, with me? Awesome. I hope I didn’t lose you. The key piece which makes time travel possible in Delta’s world is its transaction log. Transaction logs are an ordered record of every transaction that has ever been performed on a data table since its inception. The transaction log provides a step by step instruction guide, dealing exactly how to get from table’s original state to its current state. Therefore, we can recreate the state of the table at any point in time by starting with an original table and processing only the commits made prior to the table. This powerful ability is known as time travel or data versioning, and this can be a lifesaver in many number of situations, and we will discuss that shortly. So, let’s look at the contents of the logs directory now. I’m using FSLS command here and giving the DBFS spot. DBFS is just a Databricks file system or transaction layer on top of S3, and as you can see that the log directory is located at the root of my customer table. As we can see while exploring the record, we have part information, name information and size of each file. You may also notice that we have optimization. If we perform any type of optimization underneath for Delta, it stores it as optimization. Then we have checkpoints, CRC and JSON files. Let’s look at what each of these five means. Each commit is written out as a JSON file, which contains an atomic set of actions that should be applied to the previous state of the table and minus 1.JSON, in order to reconstruct the end snapshot of the table. You might be lost there. What this means is that if you want to reconstruct a snapshot of 12’s transaction, all the actions that are listed need to be applied to the 11 snapshot, so let’s look at that, what those actions are. As you can see, this output lists what JSON file contains, so JSON file contains the commit info for timestamp, user ID, user name, operation, original root performed and all these nice details. It can also show you operation metrics, which we will look at later, why it is being used, and then list of actions, the file path, et cetera. Next up is CRC files. CRC is a file extension for cyclic redundancy checksum file. You can use checksum to check files or other data for errors. Let’s run both of these. I’m using 11.CRC and 12.CRC so that I can show you the comparison. You can use these files. For example, a file might not have properly been downloaded due to network errors or hard drive problems that could have caused corruption or you know, data problems in the disk, so you can use this checksum file. Now you can see the total size of files, total byte size and compare their names to different files to see where the loss happened. So that’s where it’s helpful. Now checkpoints. Delta Lake writes checkpoints as an aggregate, so you’d offer data table every 10 comments. These checkpoints server as the starting point to compute the latest state of the table. Without checkpoints, Delta would have to read a large collection of JSON files, as we saw before, representing commits to each transaction log to compute the state of a table, allowing readers to shortcut the cost of reading. That’s what checkpoints are for. In addition, column level statistics Delta Lake uses do perform data skipping are stored in the checkpoint. Since it is possible that a writer will fail while writing out one or more parts of a multi-part checkpoint, the readers must only use a checkpoint when it’s completed, wherein all the fragments are present. For performance reasons, readers should search for the most recent reader checkpoint that is available and complete. One more thing about checkpoints is for any given version of a checkpoint, it must be only created after the associated Delta file has been successfully returned. So, let’s look at this one and this one. So, we got checkpoint 10 and checkpoint 20, because every 10 commits, Delta has a checkpoint file. Notice that each of the file size is different. This is because of the result of operations that were performed between 10 and 20 checkpoints. Awesome, so now we looked at nuts and bolts. The last part is history. Let’s look at the historical records by using describe history command. And since my file is in Delta, I’m going to use delta. file path and just use display command. Display command is native to Databricks. So just in case, if you are running outside Databricks notebooks, you might have to use .show or something like that. So let’s look at this one. Awesome. Now, if you watch the show “Travelers” or think about historians, those people had some evidences from the past, or they study the event in the past to either understand the current state of things or use the past event that could alter the things for the future. This means that we need to have some sense of existence of events or history behind our data to change something or to do even a time travel. It’s because of the transaction layer that does this job. Delta Lake transaction log has a well-defined open protocol that can be used by any system to read the log. It is because of the query commit history, which is preserved by Delta, that we have those references to change the future of our data applications. So as you can see, the result of this history shows us version, timestamp, user ID, username, operation parameters, job, notebook, isolation level, which is very important, and some operation metrics. So in a nutshell, query commit history contains all of the versionings, snapshots and checkpoints. So let’s elaborate on each of this. Each snapshot is a state of a table of a given version, and it is defined by, as we saw, version, metadata of the table, isolation level, set of files present in the table and set of, timestamp for files that were recently deleted, because when you delete, we will see they are still not permanently deleted. Those are called set of timestamp files, and that is reading by the query commit history. And when we say versioning, that means each of the snapshots are stored as a version, and version is a result of an action that you perform on a data table. You can access any historical version of data, meaning you can see what happened before and after an operation using history. So far, with me? Awesome. We looked at, like, all the nuts and bolts of, we looked at how to look at the history, how do we look at the logs, et cetera, right? Let’s think about the existence of Delta files for a second. The logs, version and files that are being generated must exist somewhere, some system or some storage for files. Log store is the general interface for all the critical file systems that are required to read and write the data log. Because storage systems do not necessarily provide atomicity and durability guarantees out of the box, Delta Lake transactional operations go through the logs or API instead of accessing the storage systems directly. Any file returned through the store must be made visible atomically. In other words, as I mentioned before, it should be visible in its entirety or not visible at all. It should not generate partial files. Partial files result in less, low data quality, and your downstream applications get affected. So only one writer must be able to create a file at the final destination, and it offers consistent listing. All this is possible because Delta log transactions are implemented using multi-version concurrency control, which is also called MVCC algorithm. This allows serialized naturalized isolation for consistent views, but I’ve already walked through it, what atomicity is and how consistent conflicts are resolved if two readers are writing or reading consistently Awesome. With me so far? I just hope I didn’t bore you too much. Now we understand all the concepts and the time machine itself, let’s write the time machine, and let me take you through the adventurous ride with Delta. Before that, wait, let’s think about why are we even time traveling? There might be something we are aiming for. And also, why does even Delta care about storing all this information, right? We can just do the transactions and be it, run our experiments and be it. Why do we even care about or bother about storing all this information? So let me help you answer some questions. You might be thinking about governance, how we can protect people’s information. Rollbacks. Maybe we need to go back in time to change something that can fix the future, audit via data lineage. Ah, there might be an audit, and you are wondering if you have enough evidences for auditors’ benchmarking. You might be thinking about what you can do to increase efficiency of your applications. Or it might be just the debugging. Let’s fix just the broke, whatever is broken, right? So these are the use cases that Burak already talked about previously. Let’s see it in action. So let’s talk about the governance use case first. Let’s say you get a delete request as a part of GDPR, which is aimed to protect people’s information. What that means for data engineers is that you need to locate the records for your Delta table and delete it within the specified time period. This is, again, why Delta can possibly save and help you timely address this request, because failure to do so can result in a lot of penalty for your organization. So I’m using Spark SQL again as the reader to read my data. This select query is actually going to find matching records from my main table, which is customer table and then it will match it against the table where I have requested information. So for example, if a user has requested to delete some information that is located in my customer_delete_keys table, just to match a unique key with my main table to the records and make this process simpler. So, the results of this query, it resulted in some records. Those are the records which I need to get rid of. How do I do that is using the delete command. All right, it finished running the delete query. Now, let’s display this history again. I just wanted to make sure that it’s another version which got added to ensure that our delete operation went through, and the way to do that is just using the same describe history command. Notice that there is a version 18 now. Previously, there were only 17 versions available, and notice this timestamp, which matches our current query time as well. Now, I also want to make sure that now matching records are deleted. So I’m going to run the matching query again here. It should result in zero, because if the records were deleted, it should not match any existing records anymore. It’s only zero. That means our delete was successful. Delete doesn’t permanently delete the records from Delta table. Delta retrieves it for 30 days before the complete deletion. And why is that? Let’s go to our next use case to see. Time travel also makes it easy to do rollbacks so that you can go back in time, compute the changes to a previous snapshot and fix the future, as I said before, right? For example, GDPR pipeline job had a bug that accidentally deleted user information. You can easily fix the pipeline by fixing accidental incorrect updates to the table or by restoring the table, so let’s use this command here. I’m supplying what version I want to travel back to. So since I just performed the delete, which was version 18, right? We can see that from here, it was version 18, and I want to go back to version 17 just to roll back changes. So I’m going to use that in my query, version 17. This is the part where the data is present. And then I’m going to write using overwrite mode and save it in the same path. So let’s do that. What this is doing is it is, data’s doing its magic here, spinning the wheels fast so that we can reach the point in time where we want to travel back in time to fix the data cards. Recently, we also released the restore capability. You can now restore the Delta table to an earlier state. Restoring an earlier version is done by using either a version number or a timestamp. I think our query is finished, and I want to make sure that the records are restored back, so I’m going to run this matching query again. Again, I’m using the same command. As you can see, now we have over records back. That means we traveled in time. Well done. All right, so next thing, next use case I want to talk about is audit and data lineage, so let’s go do describe history command again. Now you see how helpful the describe history command is, right, and how helpful it is to reference it. Awesome. So, data lineage transaction logs serve as a single source of truth, the central report that tracks all the changes that a user has ever made to a table. As the definitive record of every change that was ever made to a table, Delta Lake transaction log offers users a verifiable data lineage that is useful for governance, audit and compliance purposes. This is the journey of your table and your data that was taken from its initial phase to the current time zone. Current time, I mean. Yeah, so you can, you can see that we have this lineage information, and you can use this for, you know, showing your transformation of data over the time. It describes a certain data set’s origin, movement, characteristics and quality, which will be very helpful to understand what happened to your data. And then benchmarking, another use case. I’m not going to run the history command again, because the same command, same results can be used for benchmarking as well. You can see that we have operation, operation parameters, notebook, isolation level, and then we have operation metrics, which is very, again, very helpful information, and why I say it’s helpful is because, as you can see, all these metrics, they are available when you run the history using Databricks runtime six or five or above, you can use this information for performing any benchmarks against, like, how your query’s performing, what is the current state versus what does the past look like? So you can just see how much optimizations you have done. Next use case is debugging. So the data logs can also be used to trace the origin of an inadvertent change or a bug in the pipeline, so you can roll back to the exact action that caused it. This allows you to fix the broken. Another use case, which I’m not performing here, is ML use case. So let’s say if you have, you know, if you have, or you want to do a time travel for your ML models, you can ML’s floating production. So when you are doing machine learning, you may want to archive a certain version of a table on which you train the model. Future models can be tested using that archived data set, and time travel is the ability to allow you to do that. So those were the use cases. I hope now you understand why data is stored, why Delta Lake even cares about it, and why do we even do time travel, right? It is because of all these reasons. If you have any other use case of time travel, please let me know. One last thing I wanted to talk about before we move on to the next session is vacuum. Already, I mentioned that when you run the delete operation the data doesn’t get removed permanently until 30 days. Vacuum is useful to delete the data from storage completely for compliance reasons. By default, vacuum retains all the data needed for last seven days. So, let’s run this command. Delta is giving us error. And you know why? Because Delta that protects you by giving this warning to save you from any permanent changes to your data, or if a downstream reader or application is referencing your table, then they should not end up in, like, an error result or something, right? So if you are certain that there are no performance, there are no operations being performed on this table, then you will just check off by setting Delta retention duration check to false. So, I’m going to run that. Once that is applied, now the same command will run successfully. As you saw, Delta allows several functionality to ensure your data is safe. A few things to note about vacuum are the ability to time travel back to a version older than the retention period is lost after you run the vacuum. So if you ever wonder you are not able to time travel after vacuuming, it’s because you already got rid of it. And vacuum only deletes data files, not the log files. Log files are deleted automatically and asynchronously after checkpoint operations. The default retention period of log files is 30 days, which is configurable, again, by defining the log retention period property. Let’s think about GDPR use case again. If there are, if there is a less than 30 days window to comply with the request, and if a requester has asked you to completely forget their information from your data table, you can use vacuum and apply the retention policy. Another note to satisfy the delete requirement is you also need to delete it from your block storage. Right now, we are only doing it from Delta. Let’s say, when you actually performed that operation first time, you may have your raw data stored somewhere in your block storage. Make sure, as a best practice, whenever you are writing your ETL pipelines, it is best you apply a retention policy from the beginning itself so that you don’t have to worry about it later. Awesome. So that was adventurous, right, and different things that you could do. Now, let me talk about different ways you can do time travel. So we have timestamp, parameter at version number, parameter at version. All these options are available when you want to do time travel. And this information can be retrieved from your history table. You see how important describe history is? Of course, this is retrieved from my describe history table. I’m using the timestamp as of to do the time travel and display my records, and this is going to print my table counts for this particular table and this particular timestamp. I’m gonna use the second method, which is add parameter and timestamp. What I’m doing here is I’m just removing all the special characters from this timestamp and just giving, like, a right straight value here. And again, I’m doing count on the records, which brings this value. This should match with this one because it’s the same thing. I’m just using a different method to do time travel. Next thing I mentioned was version, so let’s look at the version. This timestamp was added as of version 11, and this one as version eight. That’s what I have here. And all this information I’m getting from the describe history table. For your reference, let’s go back. This is where I’m getting timestamp, and this is where I’m getting version information. Great. A similar thing you can do with version as well mentioned add parameter, and then you can just specify which version you want to travel back to. It should also give the same records, because we are basically going back to the same thing, but just using different methods to do time travel. Awesome. One more thing before we summarize the session is data retention. We talked about time machine, its properties, how time travel works, what operations you can perform. One very important thing to remember is the data retention. You can only travel back so far, depending on how much data was retained and what time limits we have set for our data. As I mentioned, if you run the vacuum, you’ll lose the historical information and you won’t be able to go back in time. Delta Lake also guarantees backward compatibility. A higher version of Databricks runtime is always able to read the data that was written by a lower version. Delta Lake will occasionally break the forward compatibility. For example, if you have a lower version of Databricks runtime, that may not be able to read and write data that was written by a higher version of Databricks runtime. It is because we offer a lot of optimization, and things can change in between on how the operation was in the higher version. So if you try to read and write the table with a version of Databricks runtime that is too low, you will get an error. Of course, you don’t have to worry about it if you are using open source. Finally, back to the present time. I hope you enjoyed the ride. To summarize, we saw how time travel is possible with data and how it improves developer productivity. Data engineers can now simplify their pipelines and rollback bad writes. Data scientists can manage their experiments better. Data analysts can do easy reporting. Organizations can now just standardize their big data technology engine while being nerds of the time travel. We are very thrilled to see what you will be able to accomplish to the knowledge you gained from our time travel session. Burak and I really appreciate your time joining us through this adventures, right? We hope you enjoyed it, and please leave us feedback if you did. We would love to see your feedback. And please ask us any questions from the chat below. Thank you so much. We hope to see you in another session soon. Thank you.
Vini Jaiswal is a Senior Developer Advocate at Databricks, where she helps data practitioners to be successful in building on Databricks and open source technologies like Apache Spark, Delta, and MLflow. She has extensive experience working with Unicorns, Digital Natives and some of the Fortune 500 companies helping with the successful implementation of Data and AI use cases in production at scale for on-premise and cloud deployments. Vini also worked as the Data Science Engineering Lead under Citi's Enterprise Operations & Technology group and interned as a Data Analyst at Southwest Airlines. She holds an MS in Information Technology and Management from the University of Texas at Dallas.
Burak Yavuz is a Software Engineer and Apache Spark committer at Databricks. He has been developing Structured Streaming and Delta Lake to simplify the lives of Data Engineers. Burak received his MS in Management Science & Engineering at Stanford and his BS in Mechanical Engineering at Bogazici University, Istanbul.