Skip to main content

Announcing Automatic Liquid Clustering

Optimized data layout for up to 10x faster queries

Automatic Liquid Clustering

Summary

  • Automatic Liquid Clustering, powered by Predictive Optimization, automates clustering key selection to continuously improve query performance and lowering costs.
  • Robust selection processes and continuous monitoring keep tables optimized.
  • TCO is maximized by automatically evaluating whether the performance gains outweigh the costs.

We’re excited to announce the Public Preview of Automatic Liquid Clustering, powered by Predictive Optimization. This feature automatically applies and updates Liquid Clustering columns on Unity Catalog managed tables, improving query performance and reducing costs.

Automatic Liquid Clustering simplifies data management by eliminating the need for manual tuning. Previously, data teams had to manually design the specific data layout for each of their tables. Now, Predictive Optimization harnesses the power of Unity Catalog to monitor and analyze your data and query patterns.

To enable Automatic Liquid Clustering, configure your UC managed unpartitioned or Liquid tables by setting the parameter CLUSTER BY AUTO.

Once enabled, Predictive Optimization analyzes how your tables are queried and intelligently selects the most effective clustering keys based on your workload. It then clusters the table automatically, ensuring data is organized for optimal query performance. Any engine reading from the Delta table benefits from these enhancements, leading to significantly faster queries. Additionally, as query patterns change, Predictive Optimization dynamically adjusts the clustering scheme, completely eliminating the need for manual tuning or data layout decisions when setting up your Delta tables.

During the Private Preview, dozens of customers tested Automatic Liquid Clustering and saw strong results. Many appreciated its simplicity and performance gains, with some already using it for their gold tables and planning to expand it across all Delta tables.

Preview customers like Healthrise have reported significant query performance improvement with Automatic Liquid Clustering:

“We have deployed Automatic Liquid Clustering to all our gold tables. Since then, our queries ran up to 10x faster. All our workloads have become much more efficient without any manual work needed in designing the data layout or running maintenance.”
— Li Zou, Principal Data Engineer , Brian Allee, Director, Data Services | Technology & Analytics, Healthrise

Choosing the best data layout is a hard problem

Applying the best data layout to your tables significantly improves query performance and cost efficiency. Traditionally, with partitioning, customers have found it difficult to design the right partitioning strategy to avoid data skews and concurrency conflicts. To further enhance performance, customers might use ZORDER atop partitioning, but ZORDERing is both expensive and even more complicated to manage.

Liquid Clustering significantly simplifies data layout-related decisions and provides the flexibility to redefine clustering keys without data rewrites. Customers only have to choose clustering keys purely based on query access patterns, without having to worry about cardinality, key order, file size, potential data skew, concurrency, and future access pattern changes. We've worked with thousands of customers who benefited from better query performance with Liquid Clustering, and we now have 3000+ active monthly customers writing 200+ PB data to Liquid-clustered tables per month.

However, even with the advances in Liquid Clustering, you still have to choose the columns to cluster by based on how you query your table. Data teams need to figure out:

  • Which tables will benefit from Liquid Clustering?
  • What are the best clustering columns for this table?
  • What if my query patterns change as business needs evolve?

Moreover, within an organization, data engineers often have to work with multiple downstream consumers to understand how tables are being queried, while also keeping up with changing access patterns and evolving schemas. This challenge becomes exponentially more complex as your data volume scales with more analytics needs.

How Automatic Liquid Clustering evolves your Data Layout

With Automatic Liquid Clustering, Databricks takes care of all data layout-related decisions for you – from table creation, to clustering your data and evolving your data layout – enabling you to focus on extracting insights from your data.

Let’s see Automatic Liquid Clustering is in action with an example table.

Consider a table example_tbl, which is frequently queried by date and customer ID. It contains data from Feb 5-6 and customer IDs A to F. Without any data layout configuration, the data is stored in insertion order, resulting in the following layout:

Suppose the customer runs SELECT * FROM example_tbl WHERE date = '2025-02-05' AND customer_id = 'B'. The query engine leverages Delta data skipping statistics (min/max values, null counts, and total records per file) to identify the relevant files to scan. Pruning unnecessary file reads is crucial, as it reduces the number of files scanned during query execution, directly improving query performance and lowering compute costs. The fewer files a query needs to read, the faster and more efficient it becomes.

In this case, the engine identifies 5 files for Feb 5, as half of the files have a min/max value for the date column matching that date. However, since data skipping statistics only provide min/max values, these 5 files all have a min/max customer_id that suggest customer B is somewhere in the middle. As a result, the query must scan all 5 files to extract entries for customer B , leading to a 50% file pruning rate (reading 5 out of 10 files).

As you see, the core issue is that customer B’s data is not colocated in a single file. This means that extracting all entries for customer B also requires reading a significant amount of entries for other customers.

Is there a way to improve file pruning and query performance here? Automatic Liquid Clustering can enhance both. Here’s how:

Behind the Scenes of Automatic Liquid Clustering: How It Works

Once enabled, Automatic Liquid Clustering continuously performs the following three steps:

  1. Collecting telemetry to determine if the table will benefit from introducing or evolving Liquid Clustering Keys.
  2. Modeling the workload to understand and identify eligible columns.
  3. Applying the column selection and evolving the clustering schemes based on cost-benefit analysis.

Predictive Optimization

Step 1: Telemetry Analysis

Predictive Optimization collects and analyzes query scan statistics, such as query predicates and JOIN filters, to determine if a table would benefit from Liquid Clustering.

With our example, Predictive Optimization detects that the columns ‘date’ and ‘customer_id’ are frequently queried.

Step 2: Workload Modeling

Predictive Optimization evaluates the query workload and identifies the best clustering keys to maximize data skipping.

It learns from past query patterns and estimates the potential performance gains of different clustering schemes. By simulating past queries, it predicts how effectively each option would reduce the amount of data scanned.

In our example, using registered scans on ‘date’ and ‘customer_id’ and assuming consistent queries, Predictive Optimization calculates that:

  • Clustering by ‘date’ reads 5 files with 50% pruning rates.
  • Clustering by ‘customer_id’, reads ~2 files (an estimate) with an 80% pruning rate.
    • Clustering by both ‘date’ and ‘customer_id’ (see data layout below) reads just 1 file with a 90% pruning rate.

Step 3: Cost-benefit Optimization

The Databricks Platform ensures that any changes to clustering keys provide a clear performance benefit, as clustering can introduce additional overhead. Once new clustering key candidates are identified, Predictive Optimization evaluates whether the performance gains outweigh the costs. If the benefits are significant, it updates the clustering keys on Unity Catalog managed tables.

In our example, clustering by ‘date’ and ‘customer_id’ results in a 90% data pruning rate. Since these columns are frequently queried, the reduced compute costs and improved query performance justify the clustering overhead.

Preview customers have highlighted Predictive Optimization's cost-effectiveness, particularly its low overhead compared to manually designing data layouts. Companies like CFC Underwriting have reported lower total cost of ownership and significant efficiency gains.

“We really love Databricks' Automatic Liquid Clustering because it gives us peace of mind that we have the most optimized data layout out-of-the-box. It also saved us a lot of time by removing the need for having an engineer to maintain the data layout. Thanks to this capability, we have noticed that our compute costs have gone down even as we've scaled up our data volume.”
— Nikos Balanis, Head of Data Platform, CFC

The capability in a nutshell: Predictive Optimization chooses liquid clustering keys on your behalf, such that the predicted cost savings from data skipping outweigh the predicted cost of clustering.

Get Started Today

If you haven’t enabled Predictive Optimization yet, you can do so by selecting Enabled next to Predictive Optimization in the account console under Settings > Feature enablement.

New to Databricks? Since November 11th, 2024, Databricks has enabled Predictive Optimization by default on all new Databricks accounts, running optimizations for all your Unity Catalog managed tables.

Get started today by setting CLUSTER BY AUTO on your Unity Catalog managed tables. Databricks Runtime 15.4+ is required to CREATE new AUTO tables or ALTER existing Liquid / unpartitioned tables. In the near future, Automatic Liquid Clustering will be enabled by default for newly created Unity Catalog managed tables. Stay tuned for more details.

Never miss a Databricks post

Subscribe to the categories you care about and get the latest posts delivered to your inbox