Comprehensive View on Intervals in Apache Spark 3.2

May 28, 2021 10:30 AM (PT)

Download Slides

Overview of intervals in Apache Spark before version 3.2, and the changes that are coming in the future releases. Conformance to the ANSI SQL standard. We will discuss existing issues of interval APIs, and how the issues will be solved by new types: year-month interval and day-time interval. I will show how to use the Interval API of Spark SQL and PySpark, and how to avoid potential problems. I will demonstrate construction of intervals from external types, and saving/loading via Spark’s built-in datasources.

In this session watch:
Maxim Gekk, Open Source Team, Databricks

 

Transcript

Maxim Gekk: Hello everyone. Thank you for joining to my talk. Today, I’m going to discuss a new feature for Apache Spark version 3.2. I’m going to discuss new types added to this Spark version.
So I’m a Maxim Gekk. My name is Maxim Gekk. I’m software engineer at Databricks. Also, I’m a committer of Apache Spark, and here you can see a link to my GitHub account. I should say, I am working on this new intervals for the last few months, and together with other community members we developed new intervals, so that should replace older calendar intervals and today we discuss more about this new feature.
This is agenda for my talk. First of all, I’m going to overview of new intervals and discuss the reason why we decided to add new types. And after that, I will [inaudible] details of new types, how you can construct, how you can apply some operation like binary arithmetics over intervals and dates and timestamps.
So here you can see the link to umbrella ticketer. And this ticketer we get together all sub-tasks regarding to the new intervals and spark of version 3.2 should come with new intervals. Starting from the new release, we wouldn’t recommend to use older type, calendar interval type and probably in the next version will be duplicated. Calendar intervals, this is type was edit a few years ago. You can see on the slides link to the original PR where we replace it, just the regular microsecond columns by new types and mentioned, for example, one of the spark committers say he added a new interval type.
Later this type was renamed to calendar interval type to both intersect with type from the SQL standard. And what was the reason to add a new type because each month in the year has a different amount of days, right? So, to give users ability to just change their only month field, we cannot just add, for example, 30 or 31 days. So if you want just add a couple months, you should be able to say to Spark, I am going to add exactly two months, not 60 or 62 to days. So that’s why we edit a new types to catalyst type system five years ago.
But this brings some problems, first of all, new type is not comparable. For example, on the example on the slide. So you can see, I try to compare the subtraction of two days [inaudible] interval of one month. And spark complains, it cannot compare because, if we have intervals of 30 days and interval of one month, we can say this intervals is the same, right? Because months can have a different amount of days. So that’s why we Spark currently, before version 3.2, Spark doesn’t support the comparation of intervals.
And that’s another problem. Even if you cannot compare intervals, you cannot order intervals. Unfortunately, for example, here I create a table. These, these two columns, day type and try to subtract in order by subtraction. So here also Spark wasn’t able to order because current Calendar interval type doesn’t support ordering, sorting. So basically because we cannot exactly say two intervals, one interval is bigger than another or not. This is two fundamental problems. And so that’s why we decided to add new types. So another problem is currently calendar interval types cannot be persistent to the external storage. You cannot, for example, storage it to the Parquet or even some text format. So like JSON or CSV also currently, calendar intervals has three fields. It’s two fields of in type amount of months and separate field for amount of days and one more field for microseconds.
And in total, if you count, so each record of calendar interval types consume 12 bytes per volume. This has been pretty huge. So comparing to other primitive types. And another problem is current interval type is not compatible to SQL standard, unfortunately.
If we look at SQL standard, and you can see the standard defines two intervals subclasses. Two interval classes. First class it’s called a year-months interval. So it’s especially to be present to combination of two fields, year and months. So another class is for days plus time part like hours, minutes, seconds, plus a fraction of seconds. Recently we edit this new catalyst types to the Spark master. And here you can see screenshot from my [PR] . And these two interval types. You’ll be released soon in version 3.2. So it will be available to users.
You can test them. And so, if we look at the SQL standard, so SQL standard defines: intervals is a combination of fields. Year-months interval, this is a pair of two fields year and months. So months field has a restriction, so it can contain only numbers from zero to 11 sources obvious. So year depends on how you internally, how many bytes your allocate for, for intervals currently Spark stores year-months months interval as integer. So that’s why it actually has some restrictions for year field. So day-time intervals, it’s a combination of a day, hour, minute, and second, where second can appear for fraction and the standard doesn’t actually pre-defines precision. Because Spark supports a timestamps with the precision of microseconds. We decided to do support by full day-time intervals with precision of microseconds as well.
So new catalyst types. So, as I said before, precision of year-months intervals, months, and the precision of day-time intervals microseconds, and the both types are comparable and portable. And the first one, year-months interval stores internally as integers. So it consumes just the four bytes per [inaudible]. And day-time interval type stores is a long, and one value has [size holes] eight bytes. And on the slides, you can see minimal and maximum value. So this some kind of a consequence so far of our decision of where and how to store intervals.
Basically with three options, how you can actually create intervals. First options from is from interval strings. And here we strictly follow SQL standard. So you can start from the internal keyboard. And in the string in quotes defines the value of interval and some kind of, at the end suffix of interval specification. So it’s a year to months, or day to seconds. Standard supports and other units, but Spark currently supports only this kind of units. So probably in the future versions, we extend units types and feel support also for example, hours to seconds or minutes to seconds and just years. So just months. And also options is how we can construct interval clones from external pipes.
Year-months interval could be constructed from the collection of Java time period instances. And when we received for values externally inside Spark we just takes two fields from this Java class or only year and months. And he checked and just keep day field. Regarding day-time interval, Spark accept Java time duration plus values. And because the duration internally support seconds precision up to nanoseconds, but, but Spark in Germany just supports only microsecond precision. We just cost and to round second fractions up to microseconds. [Shot] option is constructing interval types from integral columns. Like currently we have four functional maker interval for calendar interval type with except seven fields, as far as I remember, but for new intervals, we have overloaded versions. First one will accept only two arguments. First one for year, second one for months. And also one for day-time intervals.
This feature is supported yet, but we are going to implement it soon, I hope. This, you can see the table of operations over intervals and dates and times stamps which this operational required by your SQL standard. And if we are going to say, we confirm those SQL standards, we have to support all this operations.
Here, you can see all possible combinations which we will be supported in the Spark 3.2. We actually follow in almost in all cases, probably except one where we add and subtract day-time intervals from date here on the slide, you can see behavior of Spark 3.1. You can see if you at day-time interval to the date, just the return state, it just check the time part in Sparks 3.2 we decided to not check time part. And because from our point of view as some kind of losing info information from the user information, which is not appropriate we think. And if you don’t like such kind of behavior, you can actually ask Spark to return alter type calendar interval type instead of ANSI types. So if you set this SQL config to the false, it will impact on the subtraction of dates, also it feel influence on the Parsing of ANSI interval literals. So in the example, you can see interval literals in ANSI format. So previously we converted such kind of literals to the calendar interval type, but starting from the version 3.2, we will convert it to the day-time interval type.
I should highlight some current cases. For example, usually in some companies like in the United States, they have two periods in the year. Winter time and summer time. And for example, recently United States switches to the, from winter time to the summer time and move it clocks further by one hour. And if you look at the example, so I edit some timestamp interval for two years, and actually as a result, [stents] Four [stents] , almost three hours instead of two hours as you would expect. So basically we just some kind of edit physical duration instead of logical one. So here we follows the behavior of previous intervals and we some kind of, I cannot say we don’t follow SQL standard because in SQL standard, there is no definition of time zones is the definition of zone offset. So in SQL standards, there is no such improvements.
Next one. So as I mentioned before, we can construct intervals from external types. So if you want to, for example, construct a year-months interval, you can use a collection of Java time period. So you can see on the slides as a example, and also, The right part, I constructed day-time interval from Java time duration. So because Java time duration has the solution of up to nanoseconds, but Spark internally, just up to microseconds, we have to cast duration in nanoseconds to microseconds. You can see this on the slide.
And as a consequence of supporting external Java type, you can use this types in user defined function in user defined aggregate functions. You can apply such functions to the year, months, and day-time interval columns and change it using Java code or [scholar] Code.
On these slides, you can see how you can create tables. These new types. So SQL standard says, you can define the intervals with different units and it supports many options, but currently starting from version 3.2, Spark will support only two type of intervals. On the left side, you can see a year-months interval type and on the right side interval day-to seconds. So this new feature is still under development.
Currently we almost finish it milestone one, where we try to have a feature party to existing calendar interval type. We implemented civilization and organization of external collections and create a custom strings to interval. We implemented the old binary arithmetic operation over interval sent timestamps and dates. Currently Spark can support almost all features of calendar intervals. And on milestone two, we are going to support creation of external table and store such tables to the data sources to Parquet, JSON, to all built-in data sources, support the SQL operations, like insert or select of data should be all supported. Also, we are going to support partition discovery with new intervals currently, as this is not possible and all this work.
We actually need some help from the community. So I would ask anybody who watching this talk to look at this umbrella, JIRA, and takes some sub-tasks and help us in developing a new feature. So in milestone three, so we almost implemented most of the features. Currently from milestone three, we support access to the tables of these new intervals from the Thrift store via JDBC. Currently we returned new intervals as strings because we think this would be easier for our users.
Once at milestone four, so we need the deeper integration with other front-ends like, we need to support Python UDF and return intervals into Python. Also need to support R problem. So also this task need your help from spark community. Yeah. So that’s it from my side. Thank you for joining to my talk. Thank you very much.

Maxim Gekk

I am a software engineer in the open source team at Databricks. Inc. Since 2017, I contribute to Apache Spark, in particular, to JSON/CSV datasources, date-time APIs, datasource v2 commands and etc. A...
Read more