It happened too many times before I write this blog post. Oftentimes, when I read Excel (xls or xlsx) files into R, I encounter this strange phenomenon where dates are converted to a 5 number integer. Here is how to fix it.
For example, 02/01/2017 (dd/MM/yyyy) would be converted to 42737. It’s not a normal epoch, because epoch dates count the number of seconds since 01/01/1970. By adding extra trailing zeros to the integer, you will still get weird results.
What’s happening is that Excel is actually counting the days since 01/01/1900, with that day being day 1. Microsoft confirms that the designers of Lotus 1-2-3 were unaware that 1900 wasn’t a leap year. The bug was purposefully implemented in Excel to ensure backward compatibility.
This bug has already caused many errors on Microsoft, Sony and Google services.
Now that we know this, we need to add an extra day to account for that.
as.Date(42737, origin="1899-12-30")
On a final note: on a mac, the reference year is 01/01/1904.
By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!
Great success!