Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc. This talk will share the research that we did for the comparison about the key features and design these table format holds, the maturity of features, such as APIs expose to end user, how to work with compute engines and finally a comprehensive benchmark about transaction, upsert and mass partitions will be shared as references to audiences.
- Hi everybody. This is Junjie. I hope you're doing great and you stay safe.
My topic is a thorough comparison of Delta Lake, Iceberg, and Hudi. I'm a software engineer, working at Tencent Data Lake Team. So, I've been focused on big data area for years. This is today's agenda. So firstly I will introduce the Delta Lake, Iceberg and Hudi a little bit. And then we'll deep dive to key features comparison one by one. And then we'll have talked a little bit about the project maturity and then we'll have a conclusion based on the comparison. So as we know on Data Lake conception having come out for around time.
What features are expected for data lakes?
So what features shall we expect for Data Lake?
I did start an investigation and summarize some of them listed here. So first I think a transaction or ACID ability after data lake is the most expected feature. So we start with the transaction feature but data lake could enable advanced features like time travel, concurrence read, and write.
Secondary, definitely I think is supports both Batch and Streaming. And since streaming workload, usually allowed, data to arrive later. So we also expect that Data Lake have features like data mutation or data correction, which would allow the right data to merge into the base dataset and the correct base dataset to follow for the business view of the report for end-user.
Also as the table made changes around with the business over time. So we also expect that data lake to have features like Schema Evolution and Schema Enforcements, which could update a Schema over time.
And then we could use the Schema enforcements to prevent low-quality data from the ingesting. Also, we hope that Data Lake is, independent of the engines and the underlying storage is practical as well. So that data will store in different storage model, like AWS S3 or HDFS.
More engines like Hive or Presto and Spark could access the data. So last thing that I've not listed, we also hope that Data Lake has a scannable method with our module, which couldn't start the previous operation and files for a table. So, the projects Data Lake, Iceberg and Hudi are providing these features, to what they like. So let's take a look at them.
So Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and the big data workloads. So from it's architecture, a picture of it if we could see that it has at least four of the capability we just mentioned. like support for both Streaming and Batch.
It has a Schema Enforcement to prevent low-quality data, and it also has a good abstraction on the storage layer, two allow more various storage layers.
And it also has the transaction feature, right?
Yeah, Iceberg, Iceberg is originally from Netflix. It has been donated to the Apache Foundation about two years. So like Delta it also has the mentioned features.
So Hudi is yet another Data Lake storage layer that focuses more on the streaming processor. As you can see in the architecture picture, it has a built-in streaming service, to handle the streaming things.
So here's a quick comparison. So as you can see in table, all of them have all.
All these projects have the same, very similar feature in like transaction multiple version, MVCC, time travel, etcetera. So, some of them may not have Haven't been implemented yet but I think that they are more or less on the roadmap.
However, the details behind these features is different from each to each.
I think understand the details could help us to build a Data Lake match our business better.
Transaction model: Delta Lake
So, let's take a look at the feature difference.
So firstly transaction.
How? So Delta Lake has a transaction model based on the Transaction Log box or DeltaLog.
So it logs the file operations in JSON file and then commit to the table use atomic operations. The isolation level of Delta Lake is write serialization. Which means, it allows a reader and a writer to access the table in parallel. Well if there are two writers try to write data to table in parallel then each of them will assume that there's no changes on this table.
Then it will unlink before commit, if we all check that and if there's any changes to the latest table. Then if there's any changes, it will retry to commit.
So, Delta Lake has optimization on the commits. It will checkpoint each thing commit into each thing commit Which means each thing disem into a pocket file.
So as to improve the read performance.
The atomicity is guaranteed by HDFS rename or S3 file writes or Azure rename without overwrite. So user with the Delta Lake transaction feature.
A user could do the time travel query according to the timestamp or version number.
Transaction model: Apache Iceberg
Well as per the transaction model is snapshot based. A snapshot is a complete list of the file up in table. The table state is maintained in Metadata files. All change to the table state create a new Metadata file, and the replace the old Metadata file with atomic swap. So like Delta Lake, it apply the optimistic concurrency control And a user could able to do the time travel queries according to the snapshot id and the timestamp.
Transaction model: Apache Hudi
So Hudi's transaction model is based on a timeline, A timeline contains all actions performed on the table at different instance of the time. The timeline could provide instantaneous views of table and support that get data in the order of the arrival. It also apply the optimistic concurrency control for a reader and a writer. So a user could also do a time travel according to the Hudi commit time.
Data Mutation: Delta Lake
So next one yeah. Data mutation.
So currently both Delta Lake and Hudi support data mutation while Iceberg haven't supported. The community is working in progress. So Delta Lake's data mutation is based on Copy on Writes model. Basically it needed four steps to tool after it. So first it will find the file according to the filter expression and then it will load files as dataframe and update column values according to the...
While the logical file transformation. And then it will save the dataframe to new files.
And the finally it will log the files toolkit and add it to the JSON file and commit it to a table right over the atomic ration. So Delta Lake provide a set up and a user friendly table level API.
Like update and delete and merge into for a user. So I would say like, Delta Lake data mutation feature is a production ready feature, while Hudi's...
Data Mutation: Apache Hudi
Since Hudi focus more on the streaming processing.
And because the latency is very sensitive to the streaming processing. So Hudi has two kinds of the apps that are data mutation model. Both of them a Copy on Write model and a Merge on Read model. When a user profound Copy on Write model, it basically...
The process is what is similar to how Delta Lake is built without the records, and then update the records according to the app to our provided updated records. And then it will write most recall to files and then commit to table. So since latency is very important to data ingesting for the streaming process. So Hudi provide indexing to reduce the latency for the Copy on Write on step one. It will provide a indexing mechanism that mapping a Hudi record key to the file group and ids. So that the file lookup will be very quickly. So currently they support three types of the index. In- memory, bloomfilter and HBase. So when the data ingesting, minor latency is when people care is the latency.
Most will follow data ingesting.
They can perform licking the pride, the marginal rate table, and the Hudi will stall at delta rocks in Delta records into our format.
Into our format in block file and then it will unearth a subsequential reader will fill out the treater records according to those log files.
It also will schedule the period compaction to compact our old files to pocket, to accelerate the read performance for the later on access.
So Hudi provide table level API upsert for the user to do data mutation.
Data Mutation: Apache Iceberg
Well, as for Iceberg, currently Iceberg provide, file level API command override.
A user could use this API to build their own data mutation feature, for the Copy on Write model.
The community is for small on the Merge on Read model. Basic. The design is ready and basically it will, start the row identity of the recall to drill into the precision based three file. And the equality based that is fire then the after one or subsequent reader can fill out records according to these files.
Data Streaming Support: Delta Lake
Yeah next one is streaming support.
Yeah, there's no doubt that, Delta Lake is deeply integrated with the Sparks structure streaming. So it could serve as a streaming source and a streaming sync for the Spark streaming structure streaming. A user could control the rates, through the maxBytesPerTrigger or maxFilesPerTrigger.
It also has a small limitation. Currently you cannot handle the not paying the model.
Data Streaming Support: Apache Hudi
So as we mentioned before, Hudi has a building streaming service.
So named on Dell has been that they take a responsible for it, take a responsibility for handling the streaming seems like it provides exactly once a medical form data ingesting like a cop car.
And it's also a spot JSON or customized customize the record types. It could mention the checkpoints rollback recovery, and also spot for bragging transmission for data ingesting.
So further incremental privates or incremental scam. Hudi provide a utility named HiveIcrementalPuller which allow user to do the incremental scan while the high acquire language, Since Hudi implemented a Spark data source interface. So a user can also, do the profound incremental scan while the Spark data API with option beginning some time.
Data Streaming Support: Apache Iceberg
Well, since Iceberg doesn't bind to any streaming engines, so it could support a different type of the streaming countries it already support spark spark, structured streaming, and the community is building streaming for Flink as well.
Yeah another important feature of Schema Evolution. So Delta Lake and the Hudi both of them use the Spark schema. It's a table schema.
and it arouse to...
The Schema Evolution will happen when the right grind, right data, when you sort the data or merge the data into Baystate, if the incoming data has a new schema, then it will merge overwrite according to the writing up options. Well Iceberg handle Schema Evolution in a different way. Since Iceberg has an independent schema abstraction layer, which is part of Full schema evolution.
Which means you can update to the, we can update the table schema increase, and it also spark tradition evolution, which is very important. Imagine that you have a dataset partition by brid at beginning and as the business grows over time, you want to change the partition to finer granularity such as hour or minute, then you can update the partition spec, shoulder partition API provided by Iceberg.
Project Maturity: Integrations
Yeah so time that's all the key feature comparison So I'd like to talk a little bit about project maturity.
So firstly the upstream and downstream integration.
Delta Lake implemented, Data Source v1 interface. So a user could read and write data, while the spark data frames API. And also the Delta community is still connected that enable could enable more engines to read, great data from tables like Hive and Presto. So iceberg the same as the Delta Lake implemented a Data Source v2 interface from Spark of the Spark. It also implements the MapReduce input format in Hive StorageHandle.
So, basically, if I could write data, so the Spark data....API or it's Iceberg native Java API, and then it could be read from while any engines that support equal to format or have started a handler.
The community is also working on support.... as well.
So I know that Hudi implemented, the Hive into a format so that it could read through the Hive hyping phase. It also implemented Data Source v1 of the Spark. So Hive could store write data through the Spark Data Source v1. So as well, besides the spark data frame API to write Write data, Hudi can also as we mentioned before Hudi has a built-in DeltaStreamer.
So it's used for data ingesting that cold write streaming data into the Hudi table.
Project Maturity: Query Performance
Some things on query performance. Yeah, since Delta Lake is well integrated with the Spark, so it could enjoy or share the benefit of performance optimization from Spark such as Vectorization, Data skipping via statistics from Parquet And, Delta Lake also built some useful command like Vacuum to clean up update the task in optimize command too. You used to compare the small files into a big file that would mitigate the small file problems. So Hudi Spark, so we could also share the performance optimization. And Hudi has also has a convection, functionality that could have converted the DeltaLogs.
Delta records into parquet to separate the rate performance for the marginal real table.
As for Iceberg, since Iceberg does not bind to any specific engine. So it has some native optimization, like predicate push staff for tools, for the v2 And it has a vectorized reader, a native Vectorised reader, and it support it.
Iceberg stored statistic into the Metadata fire. So that it could help datas as well.
So it was to mention that Iceberg. It has a advanced feature and a hidden partition on which you start the partition values into a Metadata of file instead of file listing.
We could fetch with the partition information just using a reader Metadata file.
So it will help to help to improve the job planning plot.
Project Maturity: Tooling
Yeah the tooling, that's the tooling yeah.
So I know that as we know that Data Lake and Hudi provide central command line tools like in Delta Lake vaccuum history generates convert to.
Set up the authority to operate directly on tables. And Hudi also provide auxiliary commands like inspecting, view, statistic and compaction.
And it could many directly on the tables.
And well it post the metadata as tables so that user could query the metadata just like a sickle table.
So I suppose has a building a catalog service, which is used to enable the DDL and TMO spot So Hudi also has as we mentioned has a lot of utilities, like a Delta Streamer, Hive Incremental Puller. And Hudi, Deltastream data ingesting and table off search. So, yeah, I think that's all for the...
Oh, maturity comparison yeah. So, based on these comparisons and the maturity comparison.
Official comparison and maturity comparison we could have a concussion and Delta Lake has the best investigation, with the best integration with Spark ecosystem. And it could be used out of box.
And Iceberg has a great design in abstraction that could enable more potentials and extensions and Hudi I think it provides most of the convenience for the streaming process.