Apache Spark is a very popular tool for processing structured and unstructured data. When it comes to processing structured data, it supports many basic data types, like integer, long, double, string, etc. Spark also supports more complex data types, like the
Timestamp, which are often difficult for developers to understand. In this blog post, we take a deep dive into the Date and Timestamp types to help you fully understand their behavior and how to avoid some common issues. In summary, this blog covers four parts:
- The definition of the Date type and the associated calendar. It also covers the calendar switch in Spark 3.0.
- The definition of the Timestamp type and how it relates to time zones. It also explains the detail of time zone offset resolution, and the subtle behavior changes in the new time API in Java 8, which is used by Spark 3.0.
- The common APIs to construct date and timestamp values in Spark.
- The common pitfalls and best practices to collect date and timestamp objects on the Spark driver.
Date and calendar
The definition of a
Date is very simple: It’s a combination of the year, month and day fields, like (year=2012, month=12, day=31). However, the values of the year, month and day fields have constraints, so that the date value is a valid day in the real world. For example, the value of month must be from 1 to 12, the value of day must be from 1 to 28/29/30/31 (depending on the year and month), and so on.
These constraints are defined by one of many possible calendars. Some of them are only used in specific regions, like the Lunar calendar. Some of them are only used in history, like the Julian calendar. At this point, the Gregorian calendar is the de facto international standard and is used almost everywhere in the world for civil purposes. It was introduced in 1582 and is extended to support dates before 1582 as well. This extended calendar is called the Proleptic Gregorian calendar.
Starting from version 3.0, Spark uses the Proleptic Gregorian calendar, which is already being used by other data systems like pandas, R and Apache Arrow. Before Spark 3.0, it used a combination of the Julian and Gregorian calendar: For dates before 1582, the Julian calendar was used, for dates after 1582 the Gregorian calendar was used. This is inherited from the legacy
java.sql.Date API, which was superseded in Java 8 by
java.time.LocalDate, which uses the Proleptic Gregorian calendar as well.
Notably, the Date type does not consider time zones.
Timestamp and time zone
Timestamp type extends the Date type with new fields: hour, minute, second (which can have a fractional part) and together with a global (session scoped) time zone. It defines a concrete time instant on Earth. For example, (year=2012, month=12, day=31, hour=23, minute=59, second=59.123456) with session timezone UTC+01:00. When writing timestamp values out to non-text data sources like Parquet, the values are just instants (like timestamp in UTC) that have no time zone information. If you write and read a timestamp value with different session timezone, you may see different values of the hour/minute/second fields, but they are actually the same concrete time instant.
The hour, minute and second fields have standard ranges: 0–23 for hours and 0–59 for minutes and seconds. Spark supports fractional seconds with up to microsecond precision. The valid range for fractions is from 0 to 999,999 microseconds.
At any concrete instant, we can observe many different values of wall clocks, depending on time zone.
And conversely, any value on wall clocks can represent many different time instants. The time zone offset allows us to unambiguously bind a local timestamp to a time instant. Usually, time zone offsets are defined as offsets in hours from Greenwich Mean Time (GMT) or UTC+0 (Coordinated Universal Time). Such a representation of time zone information eliminates ambiguity, but it is inconvenient for end users. Users prefer to point out a location around the globe such as
This additional level of abstraction from zone offsets makes life easier but brings its own problems. For example, we now have to maintain a special time zone database to map time zone names to offsets. Since Spark runs on the JVM, it delegates the mapping to the Java standard library, which loads data from the Internet Assigned Numbers Authority Time Zone Database (IANA TZDB). Furthermore, the mapping mechanism in Java’s standard library has some nuances that influence Spark’s behavior. We focus on some of these nuances below.
Since Java 8, the JDK has exposed a new API for date-time manipulation and time zone offset resolution, and Spark migrated to this new API in version 3.0. Although the mapping of time zone names to offsets has the same source, IANA TZDB, it is implemented differently in Java 8 and higher versus Java 7.
As an example, let’s take a look at a timestamp before the year 1883 in the
America/Los_Angeles time zone:
1883-11-10 00:00:00. This year stands out from others because on November 18, 1883, all North American railroads switched to a new standard time system that henceforth governed their timetables.
Using the Java 7 time API, we can obtain time zone offset at the local timestamp as -08:00:
scala> java.time.ZoneId.systemDefault res0: java.time.ZoneId = America/Los_Angeles scala> java.sql.Timestamp.valueOf("1883-11-10 00:00:00").getTimezoneOffset / 60.0 res1: Double = 8.0
Java 8 API functions return a different result:
scala> java.time.ZoneId.of("America/Los_Angeles") .getRules.getOffset(java.time.LocalDateTime.parse("1883-11-10T00:00:00")) res2: java.time.ZoneOffset = -07:52:58
Prior to November 18, 1883, time of day was a local matter, and most cities and towns used some form of local solar time, maintained by a well-known clock (on a church steeple, for example, or in a jeweler’s window). That’s why we see such a strange time zone offset.
The example demonstrates that Java 8 functions are more precise and take into account historical data from IANA TZDB. After switching to the Java 8 time API, Spark 3.0 benefited from the improvement automatically and became more precise in how it resolves time zone offsets.
As we mentioned earlier, Spark 3.0 also switched to the Proleptic Gregorian calendar for the date type. The same is true for the timestamp type. The ISO SQL:2016 standard declares the valid range for timestamps is from
0001-01-01 00:00:00 to
9999-12-31 23:59:59.999999. Spark 3.0 fully conforms to the standard and supports all timestamps in this range. Comparing to Spark 2.4 and earlier, we should highlight the following sub-ranges:
0001-01-01 00:00:00..1582-10-03 23:59:59.999999. Spark 2.4 uses the Julian calendar and doesn’t conform to the standard. Spark 3.0 fixes the issue and applies the Proleptic Gregorian calendar in internal operations on timestamps such as getting year, month, day, etc. Due to different calendars, some dates that exist in Spark 2.4 don’t exist in Spark 3.0. For example, 1000-02-29 is not a valid date because 1000 isn’t a leap year in the Gregorian calendar. Also, Spark 2.4 resolves time zone name to zone offsets incorrectly for this timestamp range.
1582-10-04 00:00:00..1582-10-14 23:59:59.999999. This is a valid range of local timestamps in Spark 3.0, in contrast to Spark 2.4 where such timestamps didn’t exist.
1582-10-15 00:00:00..1899-12-31 23:59:59.999999. Spark 3.0 resolves time zone offsets correctly using historical data from IANA TZDB. Compared to Spark 3.0, Spark 2.4 might resolve zone offsets from time zone names incorrectly in some cases, as we showed above in the example.
1900-01-01 00:00:00..2036-12-31 23:59:59.999999. Both Spark 3.0 and Spark 2.4 conform to the ANSI SQL standard and use Gregorian calendar in date-time operations such as getting the day of the month.
2037-01-01 00:00:00..9999-12-31 23:59:59.999999. Spark 2.4 can resolve time zone offsets and in particular daylight saving time offsets incorrectly because of the JDK bug #8073446. Spark 3.0 does not suffer from this defect.
One more aspect of mapping time zone names to offsets is overlapping of local timestamps that can happen due to daylight saving time (DST) or switching to another standard time zone offset. For instance, on
3 November 2019, 02:00:00 clocks were turned backward 1 hour to
01:00:00. The local timestamp
2019-11-03 01:30:00 America/Los_Angeles can be mapped either to
2019-11-03 01:30:00 UTC-08:00 or
2019-11-03 01:30:00 UTC-07:00. If you don’t specify the offset and just set the time zone name (e.g.,
'2019-11-03 01:30:00 America/Los_Angeles'), Spark 3.0 will take the earlier offset, typically corresponding to “summer.” The behavior diverges from Spark 2.4 which takes the “winter” offset. In the case of a gap, where clocks jump forward, there is no valid offset. For a typical one-hour daylight saving time change, Spark will move such timestamps to the next valid timestamp corresponding to “summer” time.
As we can see from the examples above, the mapping of time zone names to offsets is ambiguous, and it is not one to one. In the cases when it is possible, we would recommend specifying exact time zone offsets when making timestamps, for example
'2019-11-03 01:30:00 UTC-07:00'.
Let’s move away from zone name to offset mapping, and look at the ANSI SQL standard. It defines two types of timestamps:
TIMESTAMP WITHOUT TIME ZONEor
TIMESTAMP– Local timestamp as (YEAR, MONTH, DAY, HOUR, MINUTE, SECOND). These kinds of timestamps are not bound to any time zone, and actually are wall clock timestamps.
TIMESTAMP WITH TIME ZONE– Zoned timestamp as (YEAR, MONTH, DAY, HOUR, MINUTE, SECOND, TIMEZONE_HOUR, TIMEZONE_MINUTE). The timestamps represent an instant in the UTC time zone + a time zone offset (in hours and minutes) associated with each value.
The time zone offset of a
TIMESTAMP WITH TIME ZONE does not affect the physical point in time that the timestamp represents, as that is fully represented by the UTC time instant given by the other timestamp components. Instead, the time zone offset only affects the default behavior of a timestamp value for display, date/time component extraction (e.g.
EXTRACT), and other operations that require knowing a time zone, such as adding months to a timestamp.
Spark SQL defines the timestamp type as
TIMESTAMP WITH SESSION TIME ZONE, which is a combination of the fields (
SESSION TZ) where the
SECOND field identify a time instant in the UTC time zone, and where
SESSION TZ is taken from the SQL config
spark.sql.session.timeZone. The session time zone can be set as:
- Zone offset
'(+|-)HH:mm'. This form allows us to define a physical point in time unambiguously.
- Time zone name in the form of region ID
'area/city', such as
'America/Los_Angeles'. This form of time zone info suffers from some of the problems that we described above like overlapping of local timestamps. However, each UTC time instant is unambiguously associated with one time zone offset for any region ID, and as a result, each timestamp with a region ID based time zone can be unambiguously converted to a timestamp with a zone offset.
By default, the session time zone is set to the default time zone of the Java virtual machine.
TIMESTAMP WITH SESSION TIME ZONE is different from:
TIMESTAMP WITHOUT TIME ZONE, because a value of this type can map to multiple physical time instants, but any value of
TIMESTAMP WITH SESSION TIME ZONEis a concrete physical time instant. The SQL type can be emulated by using one fixed time zone offset across all sessions, for instance UTC+0. In that case, we could consider timestamps at UTC as local timestamps.
TIMESTAMP WITH TIME ZONE, because according to the SQL standard column values of the type can have different time zone offsets. That is not supported by Spark SQL.
We should notice that timestamps that are associated with a global (session scoped) time zone are not something newly invented by Spark SQL. RDBMSs such as Oracle provide a similar type for timestamps too: TIMESTAMP WITH LOCAL TIME ZONE.
Constructing dates and timestamps
Spark SQL provides a few methods for constructing date and timestamp values:
- Default constructors without parameters:
- From other primitive Spark SQL types, such as
- From external types like Python
datetimeor Java classes
- Deserialization from data sources CSV, JSON, Avro, Parquet, ORC or others.
MAKE_DATE introduced in Spark 3.0 takes three parameters:
MONTH of the year, and
DAY in the month and makes a
DATE value. All input parameters are implicitly converted to the
INT type whenever possible. The function checks that the resulting dates are valid dates in the Proleptic Gregorian calendar, otherwise it returns
NULL. For example in PySpark:
>>> spark.createDataFrame([(2020, 6, 26), (1000, 2, 29), (-44, 1, 1)], ... ['Y', 'M', 'D']).createTempView('YMD') >>> df = sql('select make_date(Y, M, D) as date from YMD') >>> df.printSchema() root |-- date: date (nullable = true)
To print DataFrame content, let’s call the
show() action, which converts dates to strings on executors and transfers the strings to the driver to output them on the console:
>>> df.show() +-----------+ | date| +-----------+ | 2020-06-26| | null| |-0044-01-01| +-----------+
Similarly, we can make timestamp values via the
MAKE_TIMESTAMP functions. Like
MAKE_DATE, it performs the same validation for date fields, and additionally accepts time fields
MINUTE (0-59) and
SECOND has the type
Decimal(precision = 8, scale = 6) because seconds can be passed with the fractional part up to microsecond precision. For example in PySpark:
>>> df = spark.createDataFrame([(2020, 6, 28, 10, 31, 30.123456), ... (1582, 10, 10, 0, 1, 2.0001), (2019, 2, 29, 9, 29, 1.0)], ... ['YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTE', 'SECOND']) >>> df.show() +----+-----+---+----+------+---------+ |YEAR|MONTH|DAY|HOUR|MINUTE| SECOND| +----+-----+---+----+------+---------+ |2020| 6| 28| 10| 31|30.123456| |1582| 10| 10| 0| 1| 2.0001| |2019| 2| 29| 9| 29| 1.0| +----+-----+---+----+------+---------+ >>> ts = df.selectExpr("make_timestamp(YEAR, MONTH, DAY, HOUR, MINUTE, SECOND) as MAKE_TIMESTAMP") >>> ts.printSchema() root |-- MAKE_TIMESTAMP: timestamp (nullable = true)
As we did for dates, let’s print the content of the
ts DataFrame using the
show() action. In a similar way, show() converts timestamps to strings but now it takes into account the session time zone defined by the SQL config
spark.sql.session.timeZone. We will see that in the following examples.
>>> ts.show(truncate=False) +--------------------------+ |MAKE_TIMESTAMP | +--------------------------+ |2020-06-28 10:31:30.123456| |1582-10-10 00:01:02.0001 | |null | +--------------------------+
Spark cannot create the last timestamp because this date is not valid: 2019 is not a leap year.
You might notice that we didn’t provide any time zone information in the example above. In that case, Spark takes a time zone from the SQL configuration
spark.sql.session.timeZone and applies it to function invocations. You can also pick a different time zone by passing it as the last parameter of
MAKE_TIMESTAMP. Here is an example in PySpark:
>>> df = spark.createDataFrame([(2020, 6, 28, 10, 31, 30, 'UTC'), ... (1582, 10, 10, 0, 1, 2, 'America/Los_Angeles'), ... (2019, 2, 28, 9, 29, 1, 'Europe/Moscow')], ... ['YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTE', 'SECOND', 'TZ']) >>> df = df.selectExpr('make_timestamp(YEAR, MONTH, DAY, HOUR, MINUTE, SECOND, TZ) as MAKE_TIMESTAMP') >>> df = df.selectExpr("date_format(MAKE_TIMESTAMP, 'yyyy-MM-dd HH:mm:SS VV') AS TIMESTAMP_STRING") >>> df.show(truncate=False) +---------------------------------+ |TIMESTAMP_STRING | +---------------------------------+ |2020-06-28 13:31:00 Europe/Moscow| |1582-10-10 10:24:00 Europe/Moscow| |2019-02-28 09:29:00 Europe/Moscow| +---------------------------------+
As the example demonstrates, Spark takes into account the specified time zones but adjusts all local timestamps to the session time zone. The original time zones passed to the
MAKE_TIMESTAMP function will be lost because the
TIMESTAMP WITH SESSION TIME ZONE type assumes that all values belong to one time zone, and it doesn’t even store a time zone per every value. According to the definition of the
TIMESTAMP WITH SESSION TIME ZONE, Spark stores local timestamps in the UTC time zone, and uses the session time zone while extracting date-time fields or converting the timestamps to strings.
Also, timestamps can be constructed from the
LONG type via casting. If a
LONG column contains the number of seconds since the epoch 1970-01-01 00:00:00Z, it can be cast to Spark SQL’s
spark-sql> select CAST(-123456789 AS TIMESTAMP); 1966-02-02 05:26:51
Unfortunately, this approach doesn’t allow us to specify the fractional part of seconds. In the future, Spark SQL will provide special functions to make timestamps from seconds, milliseconds and microseconds since the epoch:
Another way is to construct dates and timestamps from values of the
STRING type. We can make literals using special keywords:
spark-sql> select timestamp '2020-06-28 22:17:33.123456 Europe/Amsterdam', date '2020-07-01'; 2020-06-28 23:17:33.123456 2020-07-01
or via casting that we can apply for all values in a column:
spark-sql> select cast('2020-06-28 22:17:33.123456 Europe/Amsterdam' as timestamp), cast('2020-07-01' as date); 2020-06-28 23:17:33.123456 2020-07-01
The input timestamp strings are interpreted as local timestamps in the specified time zone or in the session time zone if a time zone is omitted in the input string. Strings with unusual patterns can be converted to timestamp using the
to_timestamp() function. The supported patterns are described in Datetime Patterns for Formatting and Parsing:
spark-sql> select to_timestamp('28/6/2020 22.17.33', 'dd/M/yyyy HH.mm.ss'); 2020-06-28 22:17:33
The function behaves similarly to
CAST if you don’t specify any pattern.
For usability, Spark SQL recognizes special string values in all methods above that accept a string and return a timestamp and date:
- epoch is an alias for date ‘1970-01-01’ or timestamp
- now is the current timestamp or date at the session time zone. Within a single query it always produces the same result.
- today is the beginning of the current date for the
TIMESTAMPtype or just current date for the
- tomorrow is the beginning of the next day for timestamps or just the next day for the
- yesterday is the day before current one or its beginning for the
spark-sql> select timestamp 'yesterday', timestamp 'today', timestamp 'now', timestamp 'tomorrow'; 2020-06-27 00:00:00 2020-06-28 00:00:00 2020-06-28 23:07:07.18 2020-06-29 00:00:00 spark-sql> select date 'yesterday', date 'today', date 'now', date 'tomorrow'; 2020-06-27 2020-06-28 2020-06-28 2020-06-29
One of Spark’s great features is creating
Datasets from existing collections of external objects at the driver side, and creating columns of corresponding types. Spark converts instances of external types to semantically equivalent internal representations. PySpark allows to create a
TIMESTAMP columns from Python collections, for instance:
>>> import datetime >>> df = spark.createDataFrame([(datetime.datetime(2020, 7, 1, 0, 0, 0), ... datetime.date(2020, 7, 1))], ['timestamp', 'date']) >>> df.show() +-------------------+----------+ | timestamp| date| +-------------------+----------+ |2020-07-01 00:00:00|2020-07-01| +-------------------+----------+
PySpark converts Python’s datetime objects to internal Spark SQL representations at the driver side using the system time zone, which can be different from Spark’s session time zone settings
spark.sql.session.timeZone. The internal values don’t contain information about the original time zone. Future operations over the parallelized dates and timestamps value will take into account only Spark SQL sessions time zone according to the
TIMESTAMP WITH SESSION TIME ZONE type definition.
In a similar way as we demonstrated above for Python collections, Spark recognizes the following types as external date-time types in Java/Scala APIs:
- java.sql.Date and java.time.LocalDate as external types for Spark SQL’s DATE type
- java.sql.Timestamp and java.time.Instant for the TIMESTAMP type.
There is a difference between
java.time.* types. The
java.time.Instant were added in Java 8, and the types are based on the Proleptic Gregorian calendar — the same calendar that is used by Spark from version 3.0. The
java.sql.Timestamp have another calendar underneath — the hybrid calendar (Julian + Gregorian since
1582-10-15), which is the same as the legacy calendar used by Spark versions before 3.0. Due to different calendar systems, Spark has to perform additional operations during conversions to internal Spark SQL representations, and rebase input dates/timestamp from one calendar to another. The rebase operation has a little overhead for modern timestamps after the year 1900, and it can be more significant for old timestamps.
The example below shows making timestamps from Scala collections. In the first example, we construct a
java.sql.Timestamp object from a string. The
valueOf method interprets the input strings as a local timestamp in the default JVM time zone which can be different from Spark’s session time zone. If you need to construct instances of
java.sql.Date in specific time zone, we recommend to have a look at
java.text.SimpleDateFormat (and its method setTimeZone) or
scala> Seq(java.sql.Timestamp.valueOf("2020-06-29 22:41:30"), new java.sql.Timestamp(0)).toDF("ts").show(false) +-------------------+ |ts | +-------------------+ |2020-06-29 22:41:30| |1970-01-01 03:00:00| +-------------------+ scala> Seq(java.time.Instant.ofEpochSecond(-12219261484L), java.time.Instant.EPOCH).toDF("ts").show +-------------------+ | ts| +-------------------+ |1582-10-15 11:12:13| |1970-01-01 03:00:00| +-------------------+
Similarly, we can make a DATE column from collections of
java.LocalDate. Parallelization of
java.LocalDate instances is fully independent of either Spark’s session time zone or JVM default time zone, but we cannot say the same about parallelization of
java.sql.Date instances. There are nuances:
java.sql.Dateinstances represent local dates at the default JVM time zone on the driver
- For correct conversions to Spark SQL values, the default JVM time zone on the driver and executors must be the same.
scala> Seq(java.time.LocalDate.of(2020, 2, 29), java.time.LocalDate.now).toDF("date").show +----------+ | date| +----------+ |2020-02-29| |2020-06-29| +----------+
To avoid any calendar and time zone related issues, we recommend Java 8 types
java.LocalDate/Instant as external types in parallelization of Java/Scala collections of timestamps or dates.
Collecting dates and timestamps
The reverse operation for parallelization is collecting dates and timestamps from executors back to the driver and returning a collection of external types. For example above, we can pull the DataFrame back to the driver via the
>>> df.collect() [Row(timestamp=datetime.datetime(2020, 7, 1, 0, 0), date=datetime.date(2020, 7, 1))]
Spark transfers internal values of dates and timestamps columns as time instants in the UTC time zone from executors to the driver, and performs conversions to Python datetime objects in the system time zone at the driver, not using Spark SQL session time zone.
collect() is different from the
show() action described in the previous section.
show() uses the session time zone while converting timestamps to strings, and collects the resulted strings on the driver.
In Java and Scala APIs, Spark performs the following conversions by default:
- Spark SQL’s
DATEvalues are converted to instances of
- Timestamps are converted to instances of
Both conversions are performed in the default JVM time zone on the driver. In this way, to have the same date-time fields that we can get via
getHour(), etc. and via Spark SQL functions
HOUR, the default JVM time zone on the driver and the session time zone on executors should be the same.
Similarly to making dates/timestamps from
java.sql.Date/Timestamp, Spark 3.0 performs rebasing from the Proleptic Gregorian calendar to the hybrid calendar (Julian + Gregorian). This operation is almost free for modern dates (after the year 1582) and timestamps (after the year 1900), but it could bring some overhead for ancient dates and timestamps.
We can avoid such calendar-related issues, and ask Spark to return
java.time types, which were added since Java 8. If we set the SQL config
spark.sql.datetime.java8API.enabled to true, the
Dataset.collect() action will return:
java.time.LocalDatefor Spark SQL’s
java.time.Instantfor Spark SQL’s
Now the conversions don’t suffer from the calendar-related issues because Java 8 types and Spark SQL 3.0 are both based on the Proleptic Gregorian calendar. The
collect() action doesn’t depend on the default JVM time zone any more. The timestamp conversions don’t depend on time zone at all. Regarding to date conversion, it uses the session time zone from the SQL config
spark.sql.session.timeZone. For example, let’s look at a Dataset with
TIMESTAMP columns, set the default JVM time zone to Europe/Moscow, but the session time zone to
scala> java.util.TimeZone.getDefault res1: java.util.TimeZone = sun.util.calendar.ZoneInfo[id="Europe/Moscow",...] scala> spark.conf.get("spark.sql.session.timeZone") res2: String = America/Los_Angeles scala> df.show +-------------------+----------+ | timestamp| date| +-------------------+----------+ |2020-07-01 00:00:00|2020-07-01| +-------------------+----------+
show() action prints the timestamp at the session time
America/Los_Angeles, but if we collect the Dataset, it will be converted to
java.sql.Timestamp and printed at
Europe/Moscow by the
scala> df.collect() res16: Array[org.apache.spark.sql.Row] = Array([2020-07-01 10:00:00.0,2020-07-01]) scala> df.collect()(0).getAs[java.sql.Timestamp](0).toString res18: java.sql.Timestamp = 2020-07-01 10:00:00.0
Actually, the local timestamp 2020-07-01 00:00:00 is 2020-07-01T07:00:00Z at UTC. We can observe that if we enable Java 8 API and collect the Dataset:
scala> df.collect() res27: Array[org.apache.spark.sql.Row] = Array([2020-07-01T07:00:00Z,2020-07-01])
java.time.Instant object can be converted to any local timestamp later independently from the global JVM time zone. This is one of the advantages of
java.sql.Timestamp. The former one requires changing the global JVM setting, which influences other timestamps on the same JVM. Therefore, if your applications process dates or timestamps in different time zones, and the applications should not clash with each other while collecting data to the driver via Java/Scala
Dataset.collect() API, we recommend switching to Java 8 API using the SQL config
In this blog post, we described Spark SQL
TIMESTAMP types. We showed how to construct date and timestamp columns from other primitive Spark SQL types and external Java types, and how to collect date and timestamp columns back to the driver as external Java types. Since version 3.0, Spark switched from the hybrid calendar, which combines Julian and Gregorian calendars, to the Proleptic Gregorian calendar (see SPARK-26651 for more details). This allowed Spark to eliminate many issues such as we demonstrated earlier. For backward compatibility with previous versions, Spark still returns timestamps and dates in the hybrid calendar (
java.sql.Timestamp) from the collect like actions. To avoid calendar and time zone resolution issues when using the Java/Scala’s collect actions, Java 8 API can be enabled via the SQL config
spark.sql.datetime.java8API.enabled. Try it out today free on Databricks as part of our Databricks Runtime 7.0.
Free 2nd Edition includes updates on Spark 3.0, including the new Python type hints for Pandas UDFs, new date/time implementation, etc.