Skip to content
Home ยป Spark 3.0: Solving the “dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z” error

Spark 3.0: Solving the “dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z” error

  • by
  • 2 min read

In the past couple of weeks, I’ve been working on a project which users Spark pools in Azure Synapse. However, this appears to be a general Spark issue. I was unable to write to delta lake using Spark because I received the following error.

You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+’s Proleptic Gregorian calendar.
See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInRead to ‘LEGACY’ to rebase the datetime values w.r.t. the calendar difference during reading. Or set spark.sql.legacy.parquet.datetimeRebaseModeInRead to ‘CORRECTED’ to read the datetime values as it is.

First of all, what causes this? Apparently Spark 3.0 has issues reading very old dates (before the year 1582) and timestamps (before 1900). This is due to Spark 3.0 using the Proleptic Gregorian calendar instead of the hybrid Gregorian/Julian calendar. To solve this, there are two things you should do.

How to fix reading data

To be able to read the data into memory, you should update your spark configuration as follows.

spark.conf.set('spark.sql.legacy.parquet.datetimeRebaseModeInRead', 'CORRECTED')

If you’re adjusting the setting in Synapse, you should set this specific setting in the Spark configuration file, and reload your Spark pool.

How to fix writing data

It’s not because you can now read the data into memory, that you’ll be able to write the data. For example, I couldn’t write the data to delta lake, as long as it contained these erroneous dates.

To fix this, you can run the following Spark script. It will loop over all date columns and changes weird date values to ‘1900-01-01’.

    date_cols = [item[0] for item in sdf.dtypes if item[1].startswith('date')]
    for date_col in date_cols:
        sdf = sdf
			.withColumn(date_col, 
				F.when(
					F.col(date_col) <= '1900-01-01', 
					F.to_date(F.lit('1900-01-01'), 'yyyy-MM-dd'))
				.otherwise(F.col(date_col)))

Good luck!

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Leave a Reply

Your email address will not be published.