Today, we are excited to announce the public preview of Low Shuffle Merge in Delta Lake, available on AWS, Azure, and Google Cloud.
This new and improved MERGE algorithm is substantially faster and provides huge cost savings for our customers, especially with common use cases like updating a small number of rows in a given file. And, together with Photon, the next generation query engine, Low Shuffle Merge will give customers unmatched performance gains, speeding up MERGE operations for better performance and lower compute costs. Additionally, Low Shuffle Merge now maintains existing data clustering to provide better performance out-of-the-box and reduce the need to run Z-order optimization on the data often.
Low Shuffle Merge provides better performance by processing unmodified rows in a separate, more streamlined processing mode, instead of processing them together with the modified rows. As a result, the amount of shuffled data is reduced significantly, leading to improved performance. Low Shuffle Merge also removes the need for users to re-run the OPTIMIZE ZORDER BY command after performing a MERGE operation. For the data that has already been sorted (using OPTIMIZE Z-ORDER BY), Low Shuffle Merge maintains that sorting for all records that are not being modified by the MERGE command. These improvements save significant time and compute costs.
Enabling Low Shuffle Merge is free and easy to do. Upgrade your cluster to Databricks Runtime 9.0 and set the following spark configuration:
SET spark.databricks.delta.merge.enableLowShuffle = true;
You can upgrade to the latest Databricks runtime release via the Clusters page in the Databricks UI (learn more). You can set then enable Low Shuffle Merge by setting the above configuration before running MERGE INTO commands in the notebook, or at the cluster level to be applied automatically to all MERGE commands. When the feature is released as Generally Available later this year, it will be automatically turned on by default after upgrading to the latest DBR release.
We strongly recommend using Photon with Low Shuffle Merge to get even faster performance and more cost savings. Learn more about Photon in this blog.