A user-defined function (UDF) is a means for a user to extend the native capabilities of Apache Spark™ SQL. SQL on Databricks has supported external user-defined functions written in Scala, Java, Python and R programming languages since 1.3.0. While external UDFs are very powerful, they also come with a few caveats:
To cope with the above limitations, we are thrilled to introduce a new form of UDF: SQL UDFs. Available in DBR 9.1 LTS, the SQL UDF is completely defined with the expressive power of SQL and also completely transparent to the SQL Compiler.
Check out the Why the Data Lakehouse is Your Next Data Warehouse ebook to discover the inner workings of the Databricks Lakehouse Platform.
SQL UDFs are simple yet powerful extensions to SQL on Databricks. As functions, they provide a layer of abstraction to simplify query construction - making SQL queries more readable and modularized. Unlike UDFs that are written in a non-SQL language, SQL UDFs are more lightweight for SQL users to create. SQL function bodies are transparent to the query optimizer thus making them more performant than external UDFs. SQL UDFs can be created as either temporary or permanent functions, be reused across multiple queries, sessions and users, and be access-controlled via Access Control Language (ACL). In this blog, we will walk you through some key use cases of SQL UDFs with examples.
Let’s start with the most simplistic function imaginable: a constant. We all know we’re not supposed to use literals in our code because it harms readability and, who knows, maybe the constant doesn’t remain constant after all. So we want to be able to change it in one place only:
If you are familiar with external UDFs, you can see there are some differences that stand out:
Beyond these differences there are many other things that are the same as external UDF:
Let’s use the function:
Unsurprisingly this works. But what is happening under the hood?
This is neat! The SQL compiler replaced the function invocation with the constant itself.
That means at least this SQL UDF comes at zero cost to performance.
Now, let’s have a look at another common usage pattern.
Imagine you don’t like the naming of some built-in functions. Maybe you are migrating lots of queries from another product, which has different function names and behaviors. Or perhaps you just can’t stand copy-pasting some lengthy expressions over and over again in your SQL queries. So, you want to fix that.
With SQL UDF, we can simply create a new function with the name we like:
Let’s have a look at what new syntax was used here:
Not only does it work …
... but it works well:
We can see that the physical plan shows a straight application of the functions lpad, hex, least and greatest. This is the same plan you get invoking the series of functions directly.
You can also compose SQL functions out of SQL functions:
Another common usage of SQL UDF is to codify lookups. A simple lookup may be to decode RGB color codes into English color names:
OK, but there are a lot more than two colors in this world. And we want this translation both ways, so these should really be in a lookup table:
There are multiple new concepts applied here:
How does the physical plan look like now? It is easy to see that using an external UDF, which itself performs a query that would result in a nested loop join, is an awful way to burn precious resources.
In this case, Catalyst has chosen a broadcast hash join instead of a nested loop join. It can do this because it understands the content of the SQL UDF.
Thus far, all examples discussed used scalar-valued functions – ones that return a single value. That result may be of any type, even complex combinations of structs, arrays, and maps.There is also another type of UDF to discuss – the table-valued UDF.
Imagine if views took arguments! You could encapsulate complex predicates even if they rely on user-provided values. A SQL Table UDF is just that: a view by any other name, except with parameters.
Let’s assume that the color mapping above is not unique. At the very least, we can assert the color names differ across languages.
Therefore the `from_rgb` function needs to be modified to return either an array of names or a relation.
As you can see, the only difference compared to a scalar function is a more complex RETURNS clause. Unlike views, SQL UDFs mandate a declaration of the returned relation’s signature:
User-defined table functions are new to DBR. Let’s have a look at how to invoke them.
In its simplest form, a table function is invoked in the same way and the same places a view is referenced. The only difference are the mandatory braces, which include the function’s arguments. This function is invoked with literal arguments, but the arguments can be any expression, even scalar subqueries.
Most powerful, however, is the usage of SQL table UDF in a join, typically a correlated cross join:
Here the arguments refer (correlate) to a preceding (lateral) relation in the FROM clause. The new LATERAL keyword gives Catalyst permission to resolve these columns. Also note that you can refer to the result of the table function by naming the columns as defined in the result signature and optionally qualified by the function name.
Naturally, SQL UDFs are fully supported by the existing GRANT, REVOKE, SHOW, DESCRIBE and DROP statements.
The statement worth pointing out in more detail is DESCRIBE.
The basic describe returns what you might expect, but the extended DESCRIBE adds significantly more detail:
What we have described represents the initial functionality for SQL UDF. Future extensions we are pondering include support for:
SQL UDFs are a big step forward in SQL usability and can be used in many different ways as outlined in this blog. We encourage you to think of even more creative ways to leverage SQL UDFs be it in Databricks SQL or using Photon for Data Engineering jobs. Try the notebook here and see the documentation for more information.