Skip to main content

Predictive Optimization Automatically Delivers Faster Queries and Lower TCO

Up to 20x improvements in query speed and 2x storage cost reduction with Unity Catalog Managed Tables
Share this post

Predictive Optimization (PO) enhances the performance of Unity Catalog managed tables by intelligently optimizing data layouts, leading to significant improvements in query performance and reductions in storage costs. Since its General Availability, over 2,400 customers have leveraged PO to achieve optimized data layouts out of the box automatically. The results have been impressive: PO has compacted ~14 PB of data and effectively vacuumed more than 130 PB, showcasing its capability to manage and optimize extensive data volumes efficiently.

Explore how Predictive Optimization within the lakehouse architecture can effectively reduce your storage costs by 2x and enhance query performance by as much as 20x.

Predictive Optimization: the first data intelligence maintenance solution for the Lakehouse

Predictive Optimization in Databricks automates table management by leveraging Unity Catalog and the Data Intelligence Platform. This innovative feature currently runs the following optimizations for Unity Catalog managed tables:

  • Compaction – This enhances query performance by optimizing file sizes, ensuring that data retrieval is efficient.
  • Liquid Clustering – This technique incrementally clusters incoming data, enabling optimal data layout and efficient data skipping.
  • VACUUM – This operation helps reduce costs by deleting unneeded files from storage.

Previously, these optimization functions were limited to closed file formats in traditional data warehouses.  As the first managed solution to offer table maintenance for open table formats, Predictive Optimization eliminates the need for manual, repetitive table optimization tasks. Tailored specifically for the lakehouse architecture, PO allows data teams to prioritize deriving actionable insights from their data over the overhead of table optimization.

Our AI-driven performance enhancements analyze query patterns alongside data layout, table properties, and performance factors to determine the most impactful optimizations. Predictive Optimization carefully assesses each operation, only running those that deliver cost-effective benefits.  

Predictive Optimization Performance on Customer Workloads

Let’s look at a typical customer workload. After customers ingest data to their tables, PO is able to learn from the query patterns on the data and apply optimizations to both tables. 

Read on to see the impact that Predictive Optimization has on these workloads. 

Faster Queries: 20X query latency reduction

Graph showing 20x improvement in query performance when Predictive Optimization is enabled

 

Selective queries ran 20x faster on customer’s tables and improved large table scans by an average of 68%. 

This performance boost comes from Predictive Optimization keeping the data in the most optimized file sizes while incrementally clustering new data. The customer’s tables are stored with Delta Lake Liquid Clustering, which provides an optimized data layout for better data skipping. Liquid Clustering is an innovative data management technique that is flexible and simplifies data layout-related decisions – you no longer have to fine-tune your data layout to achieve optimal query performance. 

Lower Costs: 2X Storage Cost Reduction

Graph shows 2x improvement in storage costs when Predictive Optimization is enabled.

 

Predictive Optimization automatically reduced storage costs on the customer's tables by 2x—removing manual table maintenance. For example, PO intelligently detects and garbage collects unneeded files, driving significant cost savings and automatically boosting storage efficiency.

Maximizing Value While Minimizing Total Cost of Ownership (TCO)

Graphic shows the lifecycle of Databricks Predictive optimization. Telemetry based on table data and query patterns is used in model evaluation to determine optimal performance, and those optimizations are carried out.

 

Enable Predictive Optimization today and your TCO will go down. All this intelligence and optimization comes at just <5% of the ingestion cost. 

Looking Ahead

We are continuously innovating with new capabilities to make Predictive Optimization better for your Unity Catalog managed tables. 

Predictive Optimization will include intelligent statistics collection and their maintenance. With PO, statistics will be collected during supported write operations and updated using automated ANALYZE tasks. Specific to Delta stats, PO will determine the best 32 columns, not just the first 32 columns to collect statistics for. Statistics are a vital component in generating optimal query plans and enabling file-skipping. 

PO with intelligent statistics collection is in a gated Public Preview. In order to sign-up, please fill out this form.

Get started today

If you already have an active Databricks account, get started today by selecting Enabled next to Predictive Optimization in the account console under Settings > Feature enablement.

Screenshot shows the line item in Settings > Feature enablement where you can enable Predictive Optimization

With a single click, Predictive Optimization's intelligence engine will begin making your data faster and more cost-effective. See the documentation for more information.

New to Databricks? Since November 11th, 2024, Databricks has enabled Predictive Optimization by default on all new Databricks accounts, running optimizations for all your Unity Catalog managed tables. 

What does this all mean? Enable Predictive Optimization, and your queries will go faster while reducing your total cost of ownership without lifting a finger. 

 

Try Databricks for free

Related posts

Announcing General Availability of Predictive Optimization

We're excited to announce the General Availability of Databricks Predictive Optimization. This capability intelligently optimizes your table data layouts for faster queries and...

Announcing General Availability of Liquid Clustering

May 22, 2024 by Cindy Jiang and Terry Kim in
We're excited to announce the General Availability of Delta Lake Liquid Clustering in the Databricks Data Intelligence Platform. Liquid Clustering is an innovative...

Introducing Predictive Optimization for Statistics

November 20, 2024 by Kent Marten and Mohamed Zait in
We are excited to introduce the gated Public Preview of Predictive Optimization for statistics. Announced at the Data + AI Summit, Predictive Optimization...
See all Platform Blog posts