868 reads

How Wrong Timestamp Formatter Can Steal a Few Hours of Your Time

by Ilia ZaitsevNovember 30th, 2019

Too Long; Didn't Read

There is a whole lot of methods to represent date and time in computers. You can store it as Unix epoch, string, with or without a timezone, and so on. Transforming it into something more human-friendly, something like a year, month, day, etc. The wrong time formatter could still still be your time. The bug spotted by PySpark is a mistake that took a few hours to figure out. It is useful when you need to display time in AM/PM format but not something one would like to see during data analysis.

People Mentioned

featured image - How Wrong Timestamp Formatter Can Steal a Few Hours of Your Time

There is a whole lot of methods to represent date and time in computers. You can store it as Unix epoch, string, with or without a timezone, and so on. No doubt, transforming from one of these many representations into another could be error-prone! That's precisely a kind of issue that has happened to me recently and took a few hours to figure out. Go ahead to read this short story about how the wrong time formatter could still your time.

Time Since Epoch

I am sure that most of you are familiar with this way to represent time: as the number of seconds since 1970. Usually, it looks like this.

1546297300.400974
1546297344.968518
1546297402.914051
1546297466.029219

Or, as an integer, probably a bit truncated to exclude milliseconds.

Now let's pretend that we have a dataset where time is represented using this format. Also, we know that there are 24 hours of data in total, and the dataset is pretty large, so we would like to process is with PySpark.Though convenient for machines, time since epoch format is not convenient to deal with for humans. You would like to transform it into something more human-friendly, something like a year, month, day, etc. For this purpose, you can convert the timestamp into a string with some convenient formatter. Which formatter you're going to use? How about the one shown below? (Let's say we can ignore the timezone, and don't need milliseconds for our purposes).

fmt = "YYYY-MM-dd hh:mm:ss"

We can apply it to our dataset as follows.

col = f.col('ts').cast(TimestampType())
formatter = f.date_format(col, fmt)
data = data.withColumn('datetime', formatter)

Ok great! Now you save the transformed dataset into some persistent storage and drop the original Unix timestamp column. So far, so good.

Convert Single Column into Many

Sometime later, you pick up the dataset prepared on the previous step and decide to perform a bit of Machine Learning. You load the dataset with a nice human-friendly date and time column and want to split it into several features that are convenient to send into a model's training algorithm, like a year, month, hour, etc.

data = data.withColumn('datetime', formatter)
data = data.withColumn('year', f.year('datetime'))
data = data.withColumn('month', f.month('datetime'))
data = data.withColumn('day', f.dayofmonth('datetime'))
data = data.withColumn('hour', f.hour('datetime'))
data = data.withColumn('minute', f.minute('datetime'))

Done! But see, what is going on with the hours?

+----------+-------------------+----+-----+---+----+------+
|        ts|           datetime|year|month|day|hour|minute|
+----------+-------------------+----+-----+---+----+------+
|1546300800|2019-01-01 12:00:00|2019|    1|  1|  12|     0|
|1546304400|2019-01-01 01:00:00|2019|    1|  1|   1|     0|
|1546308000|2019-01-01 02:00:00|2019|    1|  1|   2|     0|
|1546311600|2019-01-01 03:00:00|2019|    1|  1|   3|     0|
|1546315200|2019-01-01 04:00:00|2019|    1|  1|   4|     0|
|1546318800|2019-01-01 05:00:00|2019|    1|  1|   5|     0|
|1546322400|2019-01-01 06:00:00|2019|    1|  1|   6|     0|
|1546326000|2019-01-01 07:00:00|2019|    1|  1|   7|     0|
|1546329600|2019-01-01 08:00:00|2019|    1|  1|   8|     0|
|1546333200|2019-01-01 09:00:00|2019|    1|  1|   9|     0|
|1546336800|2019-01-01 10:00:00|2019|    1|  1|  10|     0|
|1546340400|2019-01-01 11:00:00|2019|    1|  1|  11|     0|
|1546344000|2019-01-01 12:00:00|2019|    1|  1|  12|     0|
|1546347600|2019-01-01 01:00:00|2019|    1|  1|   1|     0|
|1546351200|2019-01-01 02:00:00|2019|    1|  1|   2|     0|
|1546354800|2019-01-01 03:00:00|2019|    1|  1|   3|     0|
|1546358400|2019-01-01 04:00:00|2019|    1|  1|   4|     0|
|1546362000|2019-01-01 05:00:00|2019|    1|  1|   5|     0|
|1546365600|2019-01-01 06:00:00|2019|    1|  1|   6|     0|
|1546369200|2019-01-01 07:00:00|2019|    1|  1|   7|     0|
+----------+-------------------+----+-----+---+----+------+

Why does our hours' range contain only twelve hours? You know that the original dataset had the full range, from 0 to 23. How could it be that now it is reduced to a smaller range of values? It seems like something is wrong with timestamp parsing.

The Bug

I believe that an attentive reader has already spotted my mistake. (And probably laughed a lot). Check the formatting string we use one more time.

fmt = "YYYY-MM-dd hh:mm:ss"

Do you see it? Right, these two lowercase letters h. Their meaning is not what you could expect. This time formatter returns you time in o'clock format, i.e., in the range from one to twelve! It is useful when you need to display your time in AM/PM format but not something one would like to see during data analysis.

Beware that the lowercase h in the timestamp means time o'clock, while the uppercase H is from 0 to 23.

The solution is simple: replace these misleading lowercase letters with the uppercase H, and you'll get your 0..23 range, as the following snippet shows.

fmt = "YYYY-MM-dd HH:mm:ss"
col = f.col('ts').cast(TimestampType())
formatter = f.date_format(col, fmt)
data = data.withColumn('datetime', formatter)
data = data.withColumn('year', f.year('datetime'))
data = data.withColumn('month', f.month('datetime'))
data = data.withColumn('day', f.dayofmonth('datetime'))
data = data.withColumn('hour', f.hour('datetime'))
data = data.withColumn('minute', f.minute('datetime'))
data.show()

Now the output looks as it was expected!

+----------+-------------------+----+-----+---+----+------+
|        ts|           datetime|year|month|day|hour|minute|
+----------+-------------------+----+-----+---+----+------+
|1546300800|2019-01-01 00:00:00|2019|    1|  1|   0|     0|
|1546304400|2019-01-01 01:00:00|2019|    1|  1|   1|     0|
|1546308000|2019-01-01 02:00:00|2019|    1|  1|   2|     0|
|1546311600|2019-01-01 03:00:00|2019|    1|  1|   3|     0|
|1546315200|2019-01-01 04:00:00|2019|    1|  1|   4|     0|
|1546318800|2019-01-01 05:00:00|2019|    1|  1|   5|     0|
|1546322400|2019-01-01 06:00:00|2019|    1|  1|   6|     0|
|1546326000|2019-01-01 07:00:00|2019|    1|  1|   7|     0|
|1546329600|2019-01-01 08:00:00|2019|    1|  1|   8|     0|
|1546333200|2019-01-01 09:00:00|2019|    1|  1|   9|     0|
|1546336800|2019-01-01 10:00:00|2019|    1|  1|  10|     0|
|1546340400|2019-01-01 11:00:00|2019|    1|  1|  11|     0|
|1546344000|2019-01-01 12:00:00|2019|    1|  1|  12|     0|
|1546347600|2019-01-01 13:00:00|2019|    1|  1|  13|     0|
|1546351200|2019-01-01 14:00:00|2019|    1|  1|  14|     0|
|1546354800|2019-01-01 15:00:00|2019|    1|  1|  15|     0|
|1546358400|2019-01-01 16:00:00|2019|    1|  1|  16|     0|
|1546362000|2019-01-01 17:00:00|2019|    1|  1|  17|     0|
|1546365600|2019-01-01 18:00:00|2019|    1|  1|  18|     0|
|1546369200|2019-01-01 19:00:00|2019|    1|  1|  19|     0|
|1546372800|2019-01-01 20:00:00|2019|    1|  1|  20|     0|
|1546376400|2019-01-01 21:00:00|2019|    1|  1|  21|     0|
|1546380000|2019-01-01 22:00:00|2019|    1|  1|  22|     0|
|1546383600|2019-01-01 23:00:00|2019|    1|  1|  23|     0|
+----------+-------------------+----+-----+---+----+------+

Even better approach would be to keep the original timestamp in the dataset so you always have the "raw" data at hand, and can apply any manipulation you want without getting into troubles with parsing strings back into proper time objects.

Such a small thing like the wrong time formatter string could cost you a few hours of debugging. Sure enough, you would better spend these hours doing something more useful and interesting. Watch your time formatting and back to the exciting projects!

If you're interested in some other posts where I write about Machine Learning and Python language in general, or you would like to reach me out, check out where the references to other stories and projects are listed.

References