by Scott Sandre, Ryan Zhu, Denny Lee and Vini Jaiswal
With the tremendous contributions from the open-source community, the Delta Lake community recently announced the release of Delta Lake 1.1.0 on Apache Spark™ 3.2. Similar to Apache Spark, the Delta Lake community has released Maven artifacts for both Scala 2.12 and Scala 2.13 and in PyPI (delta_spark).
This release includes notable improvements around MERGE operation and nested field resolution, as well as support for generated columns in a MERGE operation, Python type annotations, arbitrary expressions in ‘replaceWhere’ and more. It is super important that Delta Lake keeps up to date with the innovation in Apache Spark. This means that you can take advantage of increased performance in Delta Lake using the features that are available in Spark Release 3.2.0.
This post will go over the major changes and notable features in the new 1.1.0 release. Check out the project’s Github repository for details.
Want to get started with Delta Lake right away instead? Learn more about what is Delta Lake and use this guide to build lakehouses with Delta Lake.
Other Notable features in the Delta Lake 1.1.0 release are as follows:
In the next section, let’s dive deeper into the most notable features of this release.
Let’s take a look at the example now:
In the data set used for this example, customers1 and customers2 have 200000 rows and 11 columns with information about customers and sales. To showcase the difference between enabling the flag when running a MERGE operation on the bare minimum, we limited the Spark job to 1GB RAM and 1 core running on Macbook Pro 2019 laptop. These numbers can be further reduced by tweaking the RAM and cores used. In the MERGE table, customers_merge with 45000 rows was used to perform a MERGE operation on the former tables. Full script and results for the example are available here.
To ensure that the feature was disabled, you can run the following command:
CODE:
Results:
Note: The full operation took 19.66 minutes while the feature flag was disabled. You can refer to this full result for the details of the query.
For partitioned tables, the MERGE can produce a much larger number of small files than the number of shuffle partitions. This is because every shuffle task can write multiple files in multiple partitions, and can become a performance bottleneck. To enable faster MERGE operation on our partitioned table, let's enable repartitionBeforeWrite using the code snippet below.
This will allow MERGE operation to automatically repartition the output data of partitioned tables before writing to files. In many cases, it helps to repartition the output data by the table’s partition columns before writing it. This ensures better performance out-of-the-box for both the MERGE operation as well as subsequent read operations. Let’s run the MERGE operation on our table customer_t0 now.
Note: After enabling the feature “repartitionBeforeWrite”, the merge query took 7.68 minutes. You can refer to this full result for the details of the query.
Tip: Organizations working around the GDPR and CCPA use case can highly appreciate this feature, as it provides a cost-effective way to do fast point updates and deletes without rearchitecting your entire data lake.
To atomically replace all the data in a table, you can use overwrite mode:
With Delta Lake 1.1.0 and above, you can also selectively overwrite only the data that matches an arbitrary expression using dataframes. The following command atomically replaces records with the birth year ‘1924’ in the target table, which is partitioned by c_birth_year, with the data in customer_t1:
This query will result in a successful run and an output like below:
However, for the past releases of Delta Lake which were before 1.1.0, the same query would result in the following error:
You can try it by disabling the replaceWhere flag.
Python type annotations improve auto-completion performance in editors, which support type hints. Optionally, you can enable static checking through mypy or built-in tools (for example Pycharm tools). Here is a video from the original author of the PR, Maciej Szymkiewicz describing the changes in the behavior of python within delta lake 1.1.
Hope you got to see some cool Delta Lake features through this blog post. Excited to find out where you are using these features and if you have any feedback or examples of your work, please share with the community.
Lakehouse has become a new norm for organizations wanting to build Data platforms and architecture. And all thanks to Delta Lake - which allowed in excess of 5000 organizations out there to build successful production Lakehouse Platform for their data and Artificial Intelligence applications. With the exponential data increase, it's important to process volumes of data faster and reliably. With Delta lake, developers can make their lakehouses run much faster with the improvements in version 1.1 and keep the pace of innovation.
Interested in the open-source Delta Lake?
Visit the Delta Lake online hub to learn more, you can join the Delta Lake community via Slack and Google Group. You can track all the upcoming releases and planned features in GitHub milestones and try out Managed Delta Lake on Databricks with a free account.
Credits
We want to thank the following contributors for updates, doc changes, and contributions in Delta Lake 1.1.0: Abhishek Somani, Adam Binford, Alex Jing, Alexandre Lopes, Allison Portis, Bogdan Raducanu, Bart Samwel, Burak Yavuz, David Lewis, Eunjin Song, ericfchang, Feng Zhu, Flavio Cruz, Florian Valeye, Fred Liu, gurunath, Guy Khazma, Jacek Laskowski, Jackie Zhang, Jarred Parrett, JassAbidi, Jose Torres, Junlin Zeng, Junyong Lee, KamCheung Ting, Karen Feng, Lars Kroll, Li Zhang, Linhong Liu, Liwen Sun, Maciej, Max Gekk, Meng Tong, Prakhar Jain, Pranav Anand, Rahul Mahadev, Ryan Johnson, Sabir Akhadov, Scott Sandre, Shixiong Zhu, Shuting Zhang, Tathagata Das, Terry Kim, Tom Lynch, Vijayan Prabhakaran, Vítor Mussa, Wenchen Fan, Yaohua Zhao, Yijia Cui, YuXuan Tay, Yuchen Huo, Yuhong Chen, Yuming Wang, Yuyuan Tang, and Zach Schuermann.