The talk is about date-time processing in Spark 3.0, its API and implementations made since Spark 2.4. In particular, I am going to cover the following topics: 1. Definition and internal representation of dates/timestamps in Spark SQL. Comparisons of Spark 3.0 date-time API with previous versions and other DBMS. 2. Date/timestamp functions of Spark SQL. Nuances of behavior and details of implementation. Use cases and corner cases of date-time API. 3. Migration from the hybrid calendar (Julian and Gregorian calendars) to Proleptic Gregorian calendar in Spark 3.0. 4. Parsing of date/timestamp strings, saving and loading date/time data via Spark’s datasources. 5. Support of Java 8 time API in Spark 3.0.
Speaker: Maxim Gekk
– Hello everyone, thank you for joining to my talk. Today I’m going to speak about dates and timestamps types in Apache spark 3.0, My name is Maxim Gekk, I’m software engineer at Databricks. Before engineering position I worked as a support line, work at the customer problems issues. Here you can see my GitHub account. So, I contributed to Apache spark almost two and a half years already. So, our agenda today is cover mostly two types Date type and Timestamp type. And the changes we made in Apache spark version three. And we will talk about calendars, how to construct date type, date columns. What is the session time zone, how it impacts on certain date and timestamp values? Also, we’ll speak about some special ranges for these types enter in journal of user on date and type columns. How spark actually enter the catalyst sees these values in terminal. And also we will see how to collect and parallelize date and timestamps from the application. See, I also put a link to my blog post which I wrote together with my colleagues Vinchin and Herchin. And you can find some details there. So dates and calendar. Why we put it together? So, dates are, so a catalyst have more than 10 different types and data one of them. And calendar related to dates because it’s impact on the how we get represented dates to the user. So when we extract some fields, like a year or months or days so I would say date could be represented in the date where you could be represented as a triple of years, months and day plus some constraints. So constraints could be applied to data fields, as well as how we apply arithmetic and other application to dates. So for example, if we look at the constraints for the year so according to a SQL standard, a year should be in the range from one to 10,000 exclusive for, and spark for those is a SQL standard. And all the users specified a year since this range. Also it’s obvious months should be in the range from one to 12. About dates it depends on the month, right? So, it could be from one to 28 up or 29. So it was set to one, so it depends on the year and months. This kind of obvious constraints, but not obvious constraints come from the calendar I used inside system. For example, spark before version 3.0, up to version 2.4 internally, spark use the hybrid calendar, which is actually Julian calendar before a 1582 year and plus Gregorian calendar. So, some kind of combination of two calendars. Starting from version three spark switched to Proleptic Gregorian calendar. So, as we were speaking before about constraints, its impact on the constraints because some dates, exist in one calendar for example, some dates in Julian calendar doesn’t exist in Proleptic Gregorian calendar and some dates and Gregorian also don’t exist in Julian calendar. We will speak about this on the next two slides. And to also internally because we switched from one calender to another spark, starting from version three spark switched from Java seven time API to Java eight time API. And so it impact on the internal occupation when we need to apply some arithmetic occupation like at a few days to current date for example. Also it impacts on the parallelization of external collections and collecting dates and timestamps from spark to user applications. And, yeah. And if we look at the, Pyspark or sparkR they already support Proleptic Gregorian calender but Java or Scala application needs to be adjusted for new calendar. And below these slides you can see a link to spark issue which this is some kind of umbrella of old tickets related to switching to Proleptic Gregorian calendar. If you open this link you can see when you issue solve with, during switching to new calendar. So, because Julian calender was used in previous spark version, it’s important to highlight some key pillars of Julian calender. So initially it was proposed by Julius Caesar many many years ago. And this was the first calendar after Roman calendar, which was pretty complicated and based on lunar calendar, based on moon phases. Julian calendar, this is a full solar solar calendar and it’s more precise. So, it has two types of years, normal years and leap years. And normal year has 365 days, but leap year which happens every four years has 366 days. So, an average year it’s around 365.25 days long. It’s pretty pretty close to actual solar day but there is some kind of year. and yeah it’s why this calendar was replaced at 1582 year by another one. But Julian calender is still used in some areas in Africa, for example or Russian Orthodox Church, still use Julian calendar. Gregorian calendar, this was much more, this calendar was introduced by Pope Gregory at October, 1582. And it still has two types of years but leap years which is a divisible by four, it’s not leap year. So, comparing to Julian calender and because of this exception and Gregorian calendar it’s slightly more precise than the Julian one. And it’s adopted by many many countries around the world, for an example in 1582, Spain, Portugal, France, and other European countries adopted this calendar. But Saud Arabia or Saudi Arabia for example, just adopted the just 2018. So just a few years ago. If we, let’s compare this two calendar system. So, in hybrid calendar in Proleptic Gregorian one, so up to spark 3.0, in all internal modules and components based on hybrid calendar. Into starting from the version three, everything was with written. And for now, all internal function use Proleptic Gregorian calendar. And also to achieve this spark switch to Java eight time API which actually follow JSR-310 and the ThreeTen project, and Joda project. And if we, so previously I talk about some days so which are presented in one calender but don’t present in another one. For example, in hybrid calendar dates between, 1582, 4 October and 15 October as those days don’t exist in hybrid calendar. But some days, some old days, some based in leap year don’t exist in Proleptic Gregorian calendar. For example thousand year, this is a leap year in hybrid calendar, but it is not leap year in Proleptic Gregorian one. And why I’m talking about these days, because for example, when you parse input dates spark if it cannot parse these dates, will output some years or return nulls. If it cannot be recognized as days according to the current calendars, this is important. And also it impacts on the loading data from external data sources. On the left column you can see the link to the Jira ticket where such kind of errors was resolved, with for example, we introduce it special aggravation. We’re basing local dates, we’re basing from one calendar to another one and we do it by loading or saving dates and timestamps to all built in data sources like Archaea, ORC, Alvaron data sources. And what was as a reason to switch to new calender, because this Proleptic Gregorian calender is actually required by SQL standard, as I said before, and it’s conformed to, so following to this calendar allows us to conform to ISO standards. Also, this calendar is used by a broadly used term as databases like PostgreSQL, MySQL, SQLite uses calendar. Let’s talk about constructing dates. So in spark version three, we added a new function. It’s called Make_Date it’s for now it’s. This function is available only in, sparkSQL but in the future, we’ll edit to other APIs as well. So this function takes three columns of int type. if four column doesn’t care for int type, spark will try to cost us this combs to require the type. This is one of the way to construct dates. And as one is parallelization or from external types you can create a collection of from Scala or Python dates or timestamp, and submit it to spark basically creates a data set from this collection. And to before the version three in spark 2.4 or earlier versions spark can recognize only Java sql date and Java sql timestamp type. But starting from version three, spark works also recognize Java type time local date, and Java time instant type. So you can create a collection of this types and to parallelize basically create a data set from this collection. So another feature edit in version three, it’s constructing data from typed literals. So you can specify as a string and to the special work that data before this string and spark if you’ll try to convert this string to data. So, yeah and another way. So it’s using a special built in function to date. So this, function internal user, the date for mater is this a special class provided by Java API and the same function, the same functionality internally they used in JSON, or CSV data sources. But while we implemented this feature we faced too, a few errors or a few some kind of incompatibilities between Java seven and Java eight API, and we in, in spark of three zero we introduce a special flag legacy times time parseing policy. And if you observe for some issues, for some time when spark sometimes cannot parse your dates so you can try to set this SQL config to legacy when you and spark or switch internally to previous parser and try to parse incoming dates. Using alter for matter of parser. One more approach, for constructing dates. It’s create Current_Date and Current_Date is this a special functions which is recognized by the analyzer and it’s a beginning or a start of your query. Analyzer takes the current date and save it to the special value and and per place, or your current data function calls by this data is the beginning of the query. Yeah. And if we look at the parsing, so to be more convenient and be compatible with, spark starting from version three, has a special, where you search for dates and the timestamps. If it sees epoch or epoch string in the input, it automatically replace epoch with first January of 1970 year. And yeah, it’s applicable in all cases, in all cases, of converting strings to dates. I mean, when you’re convert typed literals or applied cost or parsing dates from the C-suite or JSON. So everywhere, if spark sees these special values it converts it to special dates. Another one it’s now, so now it’s some kind of four dates and now it’s equal and to current data, actually. And so basically it’s beginning of the query, it’s the beginning of the query. Is the same as for today and tomorrow and yesterday, is as obvious as the day, tomorrow is a day after current day and yesterday the day before current day. This is a special editor starting from observation three. Unfortunately it’s not available to look for, but if you like it so you need to switch to the new spark version. And so if we look at the, it’s a whole range of dates we can actually divide this ranging on three ranges. From first year, year one to 1582, there is a difference in the behavior between spark 2.4 and 3.0. 2.4 uses Julian calender and but version three, switched to Proleptic Gregorian calendar. So as a difference, as I told before, some dates don’t exist in Proleptic Gregorian calendar and in this particular case, if spark, for example, loads this data from external data sources, like from . And plus sees this data not available in the target calendar. spark will take the next bullet data for this particular case 29th February of 1000 year. So if spark sees those zeros, sees this date, it will take the next one. Next one is March 1st. And yeah, so this is how the basics works. If you don’t, like this behavior, you can try offer special flex. We have special flex for, reading and writing. So next range, it’s our range from 4th October to 14th October 1582 year. This year, this range, years from this range don’t exist as part of the four because calendar hybrid calendar, it doesn’t have it. So in spark 2.4, this, any years from this range will be shifted to next way with one. And yeah, this is their behavior. Is this sir. And what spark 3.0 it’s able to load it without any problems. Next next ranges up to 10,000 here, and both spark 3.0 and 2.4, conform to ANSI SQL standard. And they works in the same way. There is no difference. Let’s look at how internally spark saves days. For example, if you have dates like a string and or, load the dates from external data sources like from Avaron. So spark takes that input and convert a string or another integer along to ,is a note, is a amorphic representation some kind of number counting number from the epoch, epoch for spark it’s 1970 year, 1st January of 1970 year. And there’s this adjust, adjust the counter. So when, for example, you request a year or months from the date, spark have to take this number and convert it to the year and using the calendar. This is why calendar is important here, as we told before. So for years before 1970 year before the epoch years are negative, and when the year is after this date are positive form. So is this, so yeah Timestamp and the time zones and comparing two dates times timestamp, or depends also on time zones. So, additional, if we represent timestamp as a product of year, months, day, or hour or seconds, plus constraints like calendar, additional constraints is a session time zone. And so looking at constraints for year, months and dates, constraints are the same. And by the for hours, minutes and seconds. It’s for hours could be up to the 23, for minutes upto from zero to 58. So interest in seconds could be represented with a fraction part, spark supports, fraction up to microseconds precision. And so speaking about other constraints actually is the same as for days. spark supports starting from the version three, as well as the same as for dates spark switched to political going in comparing information to 2.4 which is based on the hybrid one. And the session time zone is taken from the session SQL config, spark’s SQL session time zone. So what is the main difference between local dates and times timestamps? Dates are by the nature it’s local, it doesn’t depend on time zone, but time timestamps especially spark’s timestamp, it’s basically instant. So it defines concrete time instant on the earth. And Yeah, and how it’s converted to local timestamp. It depends on the SQL config which can has, could be specified in two different ways. You can specify it as a zone offset, or you can specify it as a region ID. Like America has some cities like Los Angeles. Also, you can set alias for zone offsets, like UTC or Z, to convert region id to zone offsets, spark uses, some capabilities, some API from underlying libraries from TDK, basically. So resolving time zones has many many issues. And for example, if you specified a timezone as an offset there is no problem, but a problem appears appear when you set region id. In this example, you can see Java seven returns eight hours, but Java eight returns, seven hours 52 minutes, 58 seconds. That’s why, because the Java eight is more precise. And before November 18 1553. So each city in the United States had their own offset. That’s why we see the difference. So, and switching to nuclear and switching to new Java eight API, spark 3.0 becomes more precise. And also problem can happens when you switch your time break, for example switching from winter or to summertime or from summertime to wintertime, also there is difference because spark 3.0 takes summertime in the case of overlapping spark 2.4 takes winter time. If you want to avoid these problems you can set a time zone as the offset. So if we compare spark timestamp with SQL timestamp, What is the difference? So, the difference is spark’s timestamps basically it’s a timestamp with session time zone. So and it means you have columns of timestamps and one time zone, which is applied to all values inside column, this is different from SQL timestamp this time zone because according to SQL standard, time zones should be applied to all values and all values in the column, any values in the column can have different time zone. And it’s completely different from timestamp is all time zone because it doesn’t have timestamp. So here on this slide, you can see how to construct it. So basically it’s has the same ways as for days as we discussed before. And here, this slide’s about parallelization how to construct a timestamp call from collections. So important aspects, it’s when you construct it, actually you should take into account, what is the time zone is used? So usually when, if you construct it from Python collection, Python takes system timezone and it could be different from spark session. So the same is for Scala collections. So ranges are almost the same for the dates, except some ranges. For example, I would highlight year 2037. So, and starting from this data from this year. So this difference between spark 2.4 and 3.0. So 2.4 has some problems resolving timezones, you can see it’s a JDK bug, so spark 3.0 doesn’t have such bug. So this is for dates internal view. So spark stores timestamps as an offset from the epoch is and stored as, eight bytes. And again, pretty similar to, dates. Yeah. Collecting current dates have similar, similar to be here. So only the difference from yeah. Instead of local date, you have to set instant type, and you can switch up. So be here are using spark.sql.datetime.java8APi config. Yeah. Thank you very much.
I am a software engineer in the open source team at Databricks. Inc. Since 2017, I contribute to Apache Spark, in particular, to JSON/CSV datasources, date-time APIs, datasource v2 commands and etc. At Databricks, I have been working in a few departments from support L1/L2 to engineering. Before Databricks, I developed apps on top of Apache Spark in Huawei, Fyber GmbH and Cisco Systems.