Databricks SQL provides data warehousing capabilities and first class support for SQL on the Databricks Lakehouse Platform - allowing analysts to discover and share new insights faster at a fraction of the cost of legacy cloud data warehouses.
This blog is part of a series on Databricks SQL that covers critical capabilities across performance, ease of use, and governance. In a previous blog post, we covered recent user experience enhancements. In this article, we’ll cover improvements that help our users understand queries and query performance.
Databricks SQL is great at automatically speeding up queries - in fact, we recently set a world record for it! Even with today’s advancements, there are still times when you need to open up the hood and look at query execution (e.g. when a query is unexpectedly slow). That’s why we’re excited to introduce Query Profile, a new feature that provides execution details for a query and granular metrics to see where time and compute resources are being spent. The UI should be familiar to administrators who have used databases before.
Query Profile includes these key capabilities:
A common methodology for speeding up queries is to first identify the longest running query operators. We are more interested in total time spent on a task rather than the exact “wall clock time” of an operator as we’re dealing with a distributed system and operators can be executed in parallel.
From the Query Profile above of a TPC-H query, it’s easy to identify the most expensive query operator: scan of the table lineitem. The second most expensive operator is the scan of another table (orders).
Each query operator comes with a slew of statistics. In the case of a scan operator, metrics include number of files or data read, time spent waiting for cloud storage or time spent reading files. As a result, it is easy to answer questions such as which table should be optimized or whether a join could be improved.
