Apache Spark™ 3.5 and Databricks Runtime 14.0 have brought an exciting feature to the table: Python user-defined table functions (UDTFs). In this blog post, we’ll dive into what UDTFs are, why they are powerful, and how you can use them.
A Python user-defined table function (UDTF) is a new kind of function that returns a table as output instead of a single scalar result value. Once registered, they can appear in the FROM
clause of a SQL query.
Each Python UDTF accepts zero or more arguments, where each argument can be a constant scalar value such as an integer or string. The body of the function can inspect the values of these arguments in order to make decisions about what data to return.
In short, if you want a function that generates multiple rows and columns, and want to leverage the rich Python ecosystem, Python UDTFs are for you.
While Python UDFs in Spark are designed to each accept zero or more scalar values as input, and return a single value as output, UDTFs offer more flexibility. They can return multiple rows and columns, extending the capabilities of UDFs.
SQL UDTFs are efficient and versatile, but Python offers a richer set of libraries and tools. For transformations or computations needing advanced techniques (like statistical functions or machine learning inferences), Python stands out.
Let’s look at a basic Python UDTF:
In the above code, we've created a simple UDTF that takes two integers as inputs and produces two columns as output: the original number and its square.
The first step to implement a UDTF is to define a class, in this case
Next, you need to implement the eval
method of the UDTF. This is the method that does the computations and returns rows, where you define the input arguments of the function.
Note the use of the yield
statement; A Python UDTF requires the return type to be either a tuple or a Row
object so that the results can be processed properly.
Finally, to mark the class as a UDTF, you can use the @udtf
decorator and define the return type of the UDTF. Note the return type must be a StructType with block-formatting or DDL string representing a StructType with block-formatting in Spark.
You can invoke a UDTF directly using the class name.
First, register the Python UDTF:
Then you can use it in SQL as a table-valued function in the FROM clause of a query:
Apache Arrow is an in-memory columnar data format that allows for efficient data transfers between Java and Python processes. It can significantly boost performance when the UDTF outputs many rows. Arrow-optimization can be enabled using useArrow=True
.
The example above might feel basic. Let’s dive deeper with a fun example, integrating Python UDTFs with LangChain.
Now, you can invoke the UDTF:
Whether you're looking to perform complex data transformations, enrich your datasets, or simply explore new ways to analyze your data, Python UDTFs are a valuable addition to your toolkit. Try this notebook and see the documentation for more information.
This functionality is only the beginning of the Python UDTF platform. Many more features are currently in development in Apache Spark to become available in future releases. For example, it will become possible to support: