Announcing GA of Predictive I/O for Updates: Faster DML Queries, Right Out of the Box

Over 15 trillion unnecessary row writes prevented, and we’re just getting started.

Published: November 2, 2023

Platform & Products & Announcements4 min read

by Sirui Sun, Ala Luszczak, Sabir Akhadov, Andreas Chatzistergiou, Carmen Kwan, Ole Sasse, Tom van Bussel, Lars Kroll, Pengfei Xu, Polo-François Poli, Bart Samwel, Himanshu Raja, Christos Stavrakakis, Venki Korukanti, Tathagata Das and Nicolas Poggi

We’re excited to announce the General Availability of Predictive I/O for updates.

This capability harnesses Photon and Lakehouse AI in order to significantly speed up Data Manipulation Language (DML) operations like MERGE, UPDATE and DELETE, with no performance changes to read queries.

Predictive I/O for updates accomplishes this through an AI model that intelligently applies Deletion Vectors, a Delta Lake capability, which allows for the tracking of deleted rows through hyper-optimized bitmap files. The net result is significantly faster queries with significantly less overhead for data engineering teams.

Shortly following GA, we will enable Predictive I/O for updates by default on new tables. To immediately opt-in, refer to our documentation or the steps at the very end of this post.

Traditional approaches: pick your poison

Traditionally, there have been two approaches to processing DML queries, each of which had different strengths and weaknesses.

The first, most common approach is “copy-on-write”. Query engines would identify the files that contain the rows needing modification, and then rewrite all unmodified rows to a new file - filtering out the deleted rows and adding updated ones.

Under this approach, writes can be very expensive. It is common that with DML queries, only a few rows are modified. With copy-on-write, this results in a rewrite of nearly the entire file - even though very little has changed!

Image shows three files that have been modified. Copy-on-write requires each file to be rewritten based on the changes, even if the data updates are small.

An alternative approach is “merge-on-read”. Instead of rewriting the entire file, log files are written that track rows that were deleted. The reader then pieces together the table by reading both the data files and the additional log files.

With merge-on-reads, writes are much faster - no more rewriting unchanged files. However, reads become increasingly expensive over time as larger log files are generated, all of which the reader must piece together. The end user also has to determine when to “purge” those log files - rewriting the log files and data files into a new data file - in order to return to maintain reasonable read performance.

Image shows three files that have been modified. In merge-on-read, the 3 changed files plus their log file result in expensive reads.

The end result: the traditional approaches both have their own strengths and weaknesses, forcing you to “pick your poison”. For each table, you need to ask: is copy-on-write or merge-on-read better for this use case? And if you went with the latter, how often should you purge your log files?

Introducing Predictive I/O for updates - AI provides the best of all worlds

Predictive I/O for updates delivers the best of all worlds - lightning-fast DML operations and great read performance, all without the need for users to decide when to “purge” the log files. This is achieved using an AI model to automatically determine when to apply and purge Delta Lake Deletion Vectors.

The tables written by Predictive I/O for updates remain in the open Delta Lake format, readable by the Delta Lake ecosystem of connectors, including Trino and Spark running OSS Delta 2.3.0 and above.

In benchmarks of typical Data Warehousing MERGE workloads, Predictive I/O for updates delivers 10x performance improvements over the Low-Shuffle Merge technique used by Photon before.

Image shows that Classic MERGE takes the longest amount of time. Low-shuffle MERGE is cheaper, and Predictive I/O for Updates reduces the MERGE duration by up to 10x.

Since the Public Preview of Predictive I/O for updates was announced in April, we’ve worked with hundreds of customers who have successfully used the capability to get massive performance gains on their DML queries.

During that time, Predictive I/O for updates has written many billions of Deletion Vectors, and we’ve used that data to further refine the AI models which are used to determine when best to apply Deletion Vectors. We’ve found that in total, Deletion Vectors have prevented the needless rewrites of over 15 trillion rows, which would’ve otherwise been written under copy-on-write approaches. It is no wonder that customers reported significant speedups on their DML workloads:

Coming soon: Predictive I/O for updates enabled right out of the box for newly created tables

The promising results from Public Preview have given us the confidence not only to take the capability to General Availability but also to begin enabling Predictive I/O by default for newly created tables. These changes will also apply to enabling Deletion Vectors even for clusters without Photon enabled - which should still see a performance boost (though smaller in magnitude than Predictive I/O for updates).

This improvement will help you get performance improvements right out of the box without needing to remember to set the associated table properties.

You can enable Predictive I/O for updates across your workloads through a new workspace setting. This setting is available now on Azure and AWS, and will be available in GCP coming soon. To access it:

As a workspace admin, go to Admin Settings for the workspace
Select the Workspace settings tab
Go to the setting titled Auto-Enable Deletion Vectors

This setting takes effect for all Databricks SQL Warehouses and clusters with Databricks runtime 14.0+.

Alternatively, this same setting can be used to opt out of enablement by default, by simply setting the setting to Disabled.

Enable Predictive I/O today to bring the power of AI to supercharge your DML queries! And look out for more AI-powered Databricks capabilities soon to come.

What's next?

December 9, 2024/8 min read

Streamline AI Agent Evaluation with New Synthetic Data Capabilities

December 12, 2024/4 min read

Traditional approaches: pick your poison

Introducing Predictive I/O for updates - AI provides the best of all worlds

Coming soon: Predictive I/O for updates enabled right out of the box for newly created tables

Never miss a Databricks post

Sign up

What's next?

Streamline AI Agent Evaluation with New Synthetic Data Capabilities

Making AI More Accessible: Up to 80% Cost Savings with Meta Llama 3.3 on Databricks