Radical Speed for SQL Queries on Databricks: Photon Under the Hood

May 26, 2021 04:25 PM (PT)

Download Slides

Join this session to hear from the Photon product and engineering team talk about the latest developments with the project.

As organizations embrace data-driven decision-making, it has become imperative for them to invest in a platform that can quickly ingest and analyze massive amounts and types of data. With their data lakes, organizations can store all their data assets in cheap cloud object storage. But data lakes alone lack robust data management and governance capabilities. Fortunately, Delta Lake brings ACID transactions to your data lakes – making them more reliable while retaining the open access and low storage cost you are used to.

Using Delta Lake as its foundation, the Databricks Lakehouse platform delivers a simplified and performant experience with first-class support for all your workloads, including SQL, data engineering, data science & machine learning. With a broad set of enhancements in data access and filtering, query optimization and scheduling, as well as query execution, the Lakehouse achieves state-of-the-art performance to meet the increasing demands of data applications. In this session, we will dive into Photon, a key component responsible for efficient query execution.

Photon was first introduced at Spark and AI Summit 2020 and is written from the ground up in C++ to take advantage of modern hardware. It uses the latest techniques in vectorized query processing to capitalize on data- and instruction-level parallelism in CPUs, enhancing performance on real-world data and applications — all natively on your data lake. Photon is fully compatible with the Apache Spark™ DataFrame and SQL APIs to ensure workloads run seamlessly without code changes. Come join us to learn more about how Photon can radically speed up your queries on Databricks.

In this session watch:
Greg Rahn, Staff Product Manager, Databricks
Alex Behm, Technical Lead, Databricks

 

Transcript

Greg Rahn: Hello, and welcome to today’s talk, Radical Speed for SQL Queries on Databricks: Photon Under the Hood. My name is Greg Rahn, and I’m a Staff Product Manager at Databricks. And later, I’ll be joined by Alex Behm, who’s the Tech Lead for the Photon project at Databricks.
In today’s session we’ll introduce you to Photon, discuss the recent developments of Photon, show you what’s coming next in the project, and then we’ll summarize it all.
So let’s start with that quick introduction to Photon. But first, let me share with you some interesting workload trends that were observed over a large number of customers at Databricks.
First businesses are moving fast, but generally the data is moving faster. This generally manifests itself with teams spending less time properly modeling data, which frequently leads to worse performance. Two specific examples include lack of NOT NULL constraints being defined on columns, as well as lazy data type selection. It’s very easy to just choose string for everything, but that really eliminates the benefit of more specific and optimized types.
Another trend that was observed was the emergence of a data life cycle pattern, that I would summarize as such. First, landing raw data and what we will call the Bronze zone. Then, cleaning and augmenting that data, moving it into the Silver zone, or a Conform zone. And then lastly, publishing data such as reporting tables and aggregates to the Gold zone or the Model Data zone.
This brought forth a question to us, “Could we have both agility and performance on the data lake?” At Databricks, we believed that was possible, but it would take something new.
All the while these trends were happening with Databricks workloads, we also had a very common request, SQL as a first-class citizen on Databricks. That’s right. One of the biggest ask from Databricks customers, was the strength than the SQL support.
While Databricks often started out in an organization for use cases such as ETL, or data preparation, and machine learning. It was clear that the convenience of having all the data in one place was extremely attractive to SQL personas. However, those SQL persona is not only wanted to have SQL as a first-class citizen, meaning the language aspects of SQL. They were also used to having their queries return quickly. It was out of this requirement that the ideas behind Photon were born. Rich and fast SQL, natively, right on the data lake.
And just to further clarify what SQL as a first-class citizen means to us at Databricks, it means both the grammar and the behavior of SQL. This is why we have a big effort behind bringing Spark SQL, up to ANSI SQL standards compliance. It also means performance. And this specifically is where Photon fits in. And lastly, a native workspace that’s familiar with SQL analysts personas built right into the platform. So with that, let’s take a look at Photon.
Photon is a new, 100% Apache Spark compatible query engine designed for speed and flexibility. It’s built from the ground up to deliver the fastest performance on modern cloud hardware, for all data use cases, be a data engineering, data science, data analytics, and machine learning.
Now let’s double click on the high level details of Photon. So here’s what it takes to build the next generation query engine. Photon was architected for the fastest performance on real world data applications, not just popular benchmarks or synthetic workloads, it’s written in C++ to deliver the highest levels of performance. It has its own custom built memory manager to avoid JVM bottlenecks. And it implements the latest data processing techniques, including vectorization, to provide parallelism at the data, instruction, and memory levels, pushing the limits of performance that we can extract from modern cloud hardware.
Photon is completely compatible with Spark SQL, and the Spark DataFrame APIs, which means it works with your existing code. So there’s absolutely no vendor lock-in. It’s also completely transparent to end users, so no need to involve or invoke anything new, it just works right out of the box.
We are looking to use Photon to help optimize all data use cases and workloads. Today, it supports the SQL and DataFrame workloads, and we’re looking to add Streaming and Data Science workloads, and use cases to it as well.
Now you might be wondering, “Why go off and build a new query engine?” Let me show you why.
Here is a normalized performance chart of every Databricks Runtime version, going back to 2.1 in 2016. You can clearly see that performance has continued to increase over time, but in relatively small increments. Note, the DBR performance from version 8.0 in 2021, is roughly three times faster than version 2.1 from 2016. But it took five years to get that 3X performance.
Now, enter Photon. Right out of the gate, Photon is two times faster than DBR version 8.0, and thus six times faster than version 2.1. And we are just getting started. That is why I am so excited about Photon. Now, hopefully I got you excited about the performance of Photon.
So let’s dig in to how Photon fits in with the Databricks Lakehouse Platform. The purpose of this slide is to clarify where Photon fits in, and in particular, which parts are not affected by Photon. So let’s start with looking at the life cycle of a query.
First, a client submits a query to the Spark Driver. That query is parsed, analyze, planned, and optimized, as usual using Catalyst.
Then, we do a pass over the physical plan to determine which parts can run in Photon. We may make minor modifications to the plan for Photon. For example, changing a sort-merge join to hash join. But overall, the structure of the plan, including the joint order will remain the same.
The query plan is then broken up into atomic units of distribution execution, called Tasks. And each of those Tasks are run in threads and worker nodes, which then operate on a specific partition of the data. It’s at this level that the Photon engine does its work. You can think of it as replacing Sparks Whole-Stage CodeGen, with a native engine implementation. The Photon library is loaded into the JVM, and Spark and Photon communicate via JNI. So at a high level, that’s how Photon seamlessly integrates into the Spark Runtime at Databricks.
So now let’s touch on the key characteristics of Photon. Here are the key points that users should be aware of when running Photon. First, it’s completely transparent to users and fully integrated directly into Spark. Your code will just work. However, it works faster with Photon. In the case of operations that are not yet supported in Photon, it simply falls back to the Spark Runtime. Photon is integrated with Spark’s memory manager for coordinated spilling in mixed plans. Both Spark and Photon are configured to use off-heap memory, and coordinate with each other under memory pressure.
Photon uses native code. Why native code? High performance Java applications generally use off-heap memory and Unsafe operations. And if that’s the case, why not use C++, which avoids all the issues with JVM garbage collection and JIT issues? Photon makes small changes in the plan, like I mentioned, converting sort-merge join to hash join.
And today, Photon only supports the hash join, because in general, it’s better for performance. Unlike Whole-Stage CodeGen, where the plan operators can get fused or combined together, Photon preserves operator boundaries, and can therefore provide very rich metric on a per operator basis, making it much easier to understand resource consumption and bottlenecks, so that hits on the key aspects of Photon.
And now I’m going to turn it over to Alex, who’s the Tech Lead at Photon, who’ll discussed more Photon goodness with you. So Alex, over to you.

Alex Behm: Thanks Greg. Now that we’ve shared what Photon is, the motivation behind it, and the value it brings to the Databricks platform, let’s dive into the recent developments in Photon. Our development focused on three areas, production readiness, query coverage, and performance.
We’ve been working hard to make Photon more resilient, by spilling intermediate state under memory pressure, as well as using different testing strategies, trying out bugs, and of course, observing and improving how Photon behaves on real workloads.
As you know, Photon has been developed from scratch and therefore does not yet have native support for all SQL operations. The most common operators, data types, and built-in functions are implemented, and we’re continuing to expand coverage, so more queries can enjoy Photons speed.
Performance is always top of mind and fun to work on, but there is such a thing as over-optimization. So, we try not to get too carried away by prioritizing common usage patterns.
Before we go deeper, a word of warning, in the following slides, I’ll share several performance numbers focused on specific operations in controlled settings. These Microbenchmarks do not necessarily reflect real world, end-to-end performance. For example, for measuring the performance of built-in functions, we try to eliminate Scan and I/O time during benchmarking.
We’ve recently achieved an important milestone for Photon’s resilience. We’ve completed Spilling support for all important operators, including Pass Shuffle, Hash Aggregation, and Hash Join. That means Photon can handle very large inputs by spilling intermediate state to external storage, when under memory pressure. We’ve benchmarked speedups in the range of 2-5X over the Databricks Runtime without Photon.
To give you a feeling for how spilling works, and how it’s different than what you might know from Databricks today, I’d like to walk you through the spilling algorithm for hash join.
A Hash join can compute the result of joins with suitable equality conditions. It works by building a hash index on one join input, and then probing that index with the other join input.
We’ll discuss how the building probe phases are modified to support spilling. In the build phase, which is a fixed number of partitions upfront and assign each build row to one of those partitions, the buckets structure of the hash index points to entries in those partitions.
The idea is that under memory pressure, we can free memory, one partition at a time, to degrade more gracefully than spilling everything immediately. During the build phase, build rows that belong to spilled partitions are immediately spill to disk as well. We keep going until the build input has been consumed. We’ve now built a partially in-memory, and partially spilled hash index.
We now move on to the probe phase, where we consume the rows of the other join input. Probe rows that match a memory resident partition are admitted as results. Probe rows that matches spilled partition are also spilled. So for each spilled partition, we have a set of build rows and a set of probe rows.
After we’ve consumed the probe input, we can throw away the in-memory partitions because we’ve already joined and admitted all probe rows that can match. So next, we’d pick one spilled partition, and repeat the join process using the spilled build and probe rows as input to the join. That means we read the spilled build rows, partition them, and spill partitions as necessary just as before. We applied the scheme recursively until all spilled partitions have been processed.
To put this into perspective, let’s compare the behavior of Photon, and Databricks without Photon. For large joins, Spark uses a sort-merge join, and Photon prefers hash join.
With sort-merge join, We need to buffer both join inputs and sort them. The sorts run in parallel that could therefore increase memory pressure. To avoid spilling entirely, both inputs need to fit in memory. And if we do have to spill, the sort algorithm requires spilling its entire input.
Contrast that, with what we just discussed, for hash join. First, we only need to buffer one input, hopefully the smaller one into a hash table. And if we do have to spill, what we can do so, partition at a time, to degrade gracefully in scenarios where most, but not all the build data fits a memory. Lastly, we use tricks such as reversing the build probe sites when recursively processing spilled partitions, to speedup the join.
In the same vein of hardening, I’d like to give you a glimpse into our testing toolkit that has helped us tremendously. We have several goals with testing. Obviously we want to find bugs, but we also want to iron out the subtle behavioral differences between Spark and Photon, to ensure full compatibility. We also test extremes to discover system limitation that we can guard against.
Apart from the usual unit and integration tests, we found fuzz testing to be very efficient. We’ve built tooling to generate random data, and throw random queries at the system, as well as code instrumentation to randomly trigger error paths, to ensure graceful recovery from exceptional situations.
We can also simulate memory pressure, by randomly asking operators to spill. We also use the Clang/LLVM tool chain, that has some neat tools like address and undefined behavior sanitizer, that can even be combined with the fuzzing methods.
Next I’ll summarize what we’ve been working on to increase feature coverage. Today, Photon supports the most common data types operators, and built in functions. For example, we can run filter, project shuffle, join aggregate union on scalar data like integers, strings, decimals, et cetera.
In terms of built-in functions, you can expect the basics like comparisons, arithmetic, casts, conditional, as well as common date and string functions to be supported. We’re actively working on adding support for the array and map data types. And we plan to look into sorting and window functions, as well as UDFs in the not too distant future. Of course, there’s a long tail of built-in functions that we will continue to chip away at.
In the past two quarters, we have doubled down on our efforts to support more date and timestamp functions, because, unsurprisingly, they occur very frequently in real workloads. I’m happy to report that as of now, we’ve achieved a 95% coverage, and hopefully we will get to 100% soon.
If you aren’t yet as excited as I am, here are some Microbenchmarks. This chart shows microbenchmarks for several of the common date functions, comparing the Databricks Runtime, with and without Photon. The bars indicate how many times Photon was faster than Databricks without Photon. We see the speedups can be quite substantial. For example, there’s one outlier with a whopping 37 times speedup, but a good number of functions that are at least 5-10 times faster. Among them are usual suspects, like functions for extracting parts of a date like month, day, and other common functions like date trunk. These speedups are achieved by careful engineering and optimizing common cases like the UTC timezone, which is the default time zone.
Another major area of development has been adding support for Nested types, namely Struct, Array, and Map, which we’ve found to be quite popular among our customer base. The state today is that Struct is fully supported, including reads and writes. And Array and Map are on active development. Basics like projecting arrays and maps, as well as common functions are working in the lab. And we’ll continue with a long list of built-in functions on those types.
Let’s check out the numbers on this feature as well. As before, this chart shows the Photon speedup achieved for different built-in functions, this time focused on Nested types. Among the most common functions is, GetStruckField, where Photon is over 40 times faster than non Photon. I’m also happy to see the convenience StringSplit function, among those with the biggest improvements. Some of these operations are more natural to implement in a vectorized engine like Photon, but there’s also careful engineering and optimizing common usage patterns behind such numbers.
So far, we’ve mostly focused on read operations, but another area where Photon shines are write operations, specifically Photon can accelerate writing to Delta and Parquet tables. It makes sense. Delta uses a columnar data format, Photon as a columnar query engine, they are a good fit. And the typical 2-4X speedups we’ve seen, certainly agree. The relative speedup even tends to increase for wider tables.
Following that thought of writing data, we’ve also added Photon support for delete, update, and merge commands. Not only do they benefit from the writing capability, but the core operations, like joins and aggregations, proceeding the rights are also accelerated. Typical speedups we’ve seen are in the range of 2-3X, end-to-end.
Last but not least, another area of active development in both the Spark community, as well as Photon, is moving towards standard conforming behavior of SQL functions, like failing queries on overflow. Our goal is to provide familiar semantics for SQL users.
So that concludes the different coverage areas of Photon as they exist today. And here’s a summary of what we’re working on today, and in the near future.
As you might expect with an emerging project, we’ll be adding more coverage with each and every release. So be sure to check back often.
With that, I’m going to turn it back over to Greg.

Greg Rahn: Thanks Alex. It’s amazing to see how much coverage the Photon team has already built. Given all that great coverage, I can imagine, the next thing that is on everyone’s mind is, “How can I try out Photon today?” So let’s dive right into that.
In June, Photon will be available on Databricks, in two areas of the product. First, in Databricks SQL. This is the easiest way to try out Photon using SQL. You can just provision a Databricks SQL endpoint, and use the SQL native Redash interface, or connect Tableau, or Power BI, or even your favorite BYO Tool via ODBC or JDBC. Databricks SQL is available on both AWS and Azure.
And as mentioned earlier, the certain window operations are not yet supported in Photon. So just be mindful of that.
The other way to preview Photon is in the workspace for SQL Data Engineering, or ETL workloads. This can be done by provisioning a cluster with Photon enable. There’s just a little slider, or checkbox, in the UI to toggle it. And I’ll show you a screen capture of that on the next slide. And then you can just run your SQL or data frame code, right in a Notebook or from a JAR file. Photon in the workspace will be available on AWS to begin, and then Azure and GCP will be added in the future.
Two key areas, to be aware of, that are not yet supported in Photon for data engineering workloads are custom UDFs and Streaming operations.
Now, let me show you the Photon toggle for cluster in workspaces.
In the Create Clusters UI, you can see that there’s just this a little toggle button here, under the Databricks Runtime version. When you toggle that ON, the Runtime versions, that support Photon will be listed, and then you can just select the appropriate one.
Now you might be wondering if you give Photon to try via one of these ways, “How do I tell if my query job is actually using Photon?” So let me show you how.
Let’s start with a basic query that simply calculates the total and average trip distance by vendor. On the left, you can see the text representation of the query plan in Spark. Take good note of the operator names. On the right, you can see this plan for the same query from Photon. Now a keen eye will notice two things. There are many more operations in the Photon plan, and many of them actually start with the word Photon. This comes back to the point I mentioned earlier about the difference between Spark combining operations or fusing them in Whole-Stage CodeGen, and Photon keeping clean and discreet boundaries between each operation. And again, the reason for this is the detailed level metrics for Photon.
For those of you who are more visually inclined, like myself, you could take a look at the execution DAG in the Spark UI. Like the text plan, you’ll note that Photon operations start with the word Photon. But here, they are also color-coded to be different from Spark operations. Yellow denotes Photon operations, and blue denotes Spark. When you click and expand them, you can see the rich metrics that we’ve been talking about. And this is again, easier to understand where performance issues or challenges might be, because you have access to those detailed level metrics.
For those of you who are more of the SQL persona, and more likely to use the Databricks SQL UI, you can find a slightly different representation of a plan from the query history section. So, while in the Databricks SQL UI, on the left-hand navbar, click History. Go and find your query from the history list. In this example, finds at the top. When you click on it, this Query Details box, on the right hand side, will show up. And then you will have to click the Execution Details. And then at the bottom, you’ll see a Task time in Photon.
Just to zoom in, I’ll show you, here are the task details for this query. Take note that 98% of the Task time for this query was spent in Photon, which means that basically, Photon had amazing coverage for this query and the majority of the execution was done within Photon. Here, the higher number in Photon, the better.
Now that I’ve showed you how you can try out Photon, and read the query plans, my guess is, the next thing on your mind is, “Just how fast is photon?” So let’s take a look at that.
I wanted to share one customer’s journey with you. This customer wanted to use Databricks to support a large number of their SQL analysts. So we worked with them to do concurrency testing against a Databricks SQL endpoint. Almost a year ago, they started their journey, and did some testing with the concurrency workload. And at that time, the average query response time, was just under eight seconds. The Photon and Databricks engineering teams worked hard to make progress to improve this number. And by the end of 2020, the customer rerun their concurrency workload tests, and saw a 21% reduction in average query response time. They were extremely happy to see this progress. But in reality, they had slightly higher expectations, as did we at Databricks.
So just this month, we gave them a new build to try out, the latest, greatest Photon bits for their workload. They saw an additional 29% reduction with those bits and the average query response times. This meant from the original timing back in June of 2020, there was a 44% reduction.
Needless to say, they are huge fans of Photons, as now they’re able to roll out a service to all their SQL and BI users, directly on Databricks, without having to move any data, or rely on any other system.
So it wouldn’t be an interesting performance talk if I failed to show you the obligatory TPC-DS performance chart. So here it is. In this chart, I’ve plotted the query speedups that Photon brings to the 99 DS queries for the 3TB scale factor. As you can see, there is a range from mild speedups, to just over 8X on the high end. But most of them fall within the 2-4X range. This is typically what we see of most workloads. The average query speedup for this DS workload is 2.5 times.
But perhaps more interesting than that, is that the power test workload speedup, the time it takes to run all of these queries end to end is faster by 3.7 times. That means by using Photon to run this workload, the cloud infrastructure costs would just be 27% of what it would be if we did not use Photon.
This really speaks to the speed and efficiency that Photon brings to the table. And again, we’re just getting started here. So they say, you should always end on a high note. So with that, let me summarize today’s Photon talk.
In this talk, we introduced you to Photon and the motivation for Databricks building it. We discussed the recent developments, as well as what’s coming next in Photon. We highlighted some performance results from both a customer perspective, as well as our lab results. And we also showed you how you can take some of the details of Photon, and dig deeper by looking at Query Execution plans, and understanding the low-level Photon operations.
So lastly, here are some related talks where you can learn more about Photon on Databricks.
And of course, come back in June to databricks.com/try, where you’ll be able to test run Photon for yourself, and experience the amazing performance it can bring for SQL workloads.
And with that, I would like to thank you for attending this talk. And now we’ll take some questions from the audience. Thanks again.

Greg Rahn

For over 20 years, Greg has worked with relational database systems across a variety of roles - including software engineering, database administration, database performance engineering, and product m...
Read more

Alex Behm

Alex Behm

Alex has been building databases for over a decade in academia and industry, and maintains a passion for speed and quality. He is the tech lead for Photon, a new vectorized engine written from scra...
Read more