We are excited to introduce the gated Public Preview of Predictive Optimization for statistics. Announced at the Data + AI Summit, Predictive Optimization is now generally available as an AI-driven approach to streamlining optimization processes. This feature currently supports essential data layout and cleanup tasks, and early feedback from users highlights its effectiveness in vastly simplifying routine data maintenance.
With the addition of automatic statistics management, Predictive Optimization delivers customer value and simplifies operation through the following advancements:
- Intelligent selection of data-skipping statistics, eliminating the need for column order management
- Automatic collection of query optimization statistics, removing the necessity to run ANALYZE after data loading
- Once collected, statistics inform query execution strategies, and on average drive better performance and lower costs
Impact of statistics
Utilizing up-to-date statistics significantly enhances performance and total cost of ownership (TCO). Comparative analysis of query execution with and without statistics revealed an average performance increase of 22% across observed workloads. Databricks applies these statistics to refine data scanning processes and select the most efficient query execution plan. This approach exemplifies the capabilities of the Data Intelligence Platform in delivering tangible value to users.
It is not surprising to see statistics impact query performance. Statistics are used to determine query plan optimizations and are complemented by Adaptive Query Execution (AQE) at runtime. For customers participating in the Gated Public Preview, we have observed a range of performance improvements attributed to the increase in the percentage of queries with optimized join strategies and the prevalence of bloom filters. Statistics give you the best opportunity to see performance improvements.
Current challenges
The data lakehouse utilizes two distinct types of statistics: data-skipping statistics (also known as Delta stats) and query optimizer statistics. Delta stats operate at the file level, facilitating data-skipping during scan operations, and are automatically generated for the first 32 columns by default. In contrast, query optimizer statistics are table-level metrics that aid in query planning and are only gathered after running the ANALYZE command.
The current approach to statistics collection raises several challenges for data engineering teams striving for optimal performance while minimizing costs:
- How to enhance data-skipping capabilities for wide and nested schemas?
- What strategies can be used for evolving query patterns in workloads?
- What is the optimal frequency for scheduling updates to query optimizer statistics via the ANALYZE command?
While data-skipping statistics are collected automatically, as data continues to grow and usage diversifies, determining when to run the ANALYZE command becomes complex. Customers have to deal with this operational burden by actively managing their query optimizer statistics maintenance. Furthermore, many customers neglect to run the ANALYZE command regularly, likely resulting in sub-optimal query execution plans.
Predictive Optimization for Statistics
When Predictive Optimization is enabled, statistics are managed in two distinct phases. Initially, statistics are gathered for all new data processed through Photon-enabled compute (enabled by default with Databricks SQL and Serverless products). This is a more efficient and cost-effective approach to statistics collection since the data is accessed only once, unlike the conventional method of executing ANALYZE post-ingestion. Subsequently, as statistics degrade due to UPDATE and DELETE operations, Predictive Optimization triggers ANALYZE in the background, ensuring that the statistics remain current and reliable.
Smart Delta stats collection
Recent advancements in Predictive Optimization for statistics have significantly enhanced the process of collecting data-skipping statistics. Currently, there are two primary methods for gathering Delta stats: the default approach, which traditionally depends on the first 32 columns, and the option to manually specify columns.
With this gated public preview, Databricks no longer adheres to the previous 32-column constraint. Instead, it employs data clustering and usage patterns to intelligently identify the most pertinent columns for Delta stats computation.
It's important to note that if a customer has manually specified columns for Delta stats collection, those preferences will take precedence over the new default criteria established in the latest update.
Query optimizer statistics out-of-the-box
With Photon, query optimizer statistics are now automatically gathered during write operations. This means that for both newly created tables and those with existing statistics, the ANALYZE command is no longer required after data ingestion. The latest statistics become available immediately upon the completion of data loading.
Intelligent back-fill
Many existing tables lack query optimizer statistics. Predictive Optimization identifies tables with outdated or no statistics and determines when (and if) to update. This process ensures that statistics are only refreshed for tables where they provide tangible value, thus balancing performance enhancement with cost efficiency.
How Predictive Optimization for statistics works
Predictive Optimization enhances the performance and efficiency of lakehouse architecture. The process is simple. Statistics are collected during writes, so you don’t have to run ANALYZE after loading data. Delta statistics are collected based on usage factors. Predictive Optimization schedules optimizations based on their usage, data layout, and statistics staleness. All of these are easy to monitor and understand with system tables.
Sign up for the Gated Public Preview
Use this form to sign up for the Gated Public Preview of Predictive Optimization for statistics.
For the latest on supported regions for Predictive Optimization by cloud, refer to these docs: AWS | Azure | GCP.