Working with UNIX timestamps in Python
So, like many others, I thought I knew how to work with UNIX timestamps in Python. After all, I pride myself on being pretty well-versed in my one favorite language!
However, I also like many others, started using MongoDB recently. MongoDB really forces you to think about timezones, so I argued to myself that this is the equivalent of going from str to unicode - more work, little payoff, but a lot more correct and in the end saves multiple headaches.
So MongoDB stores Epoch offsets like many others, in 64-bit unsigned integers. It doesn't store any timezones or anything (note: timestamps without definite timezones are called naïve.)
Why should you care? What's the point, you ask? Where is the culmination? Well as it turns out, there are subtleties. Very subtle subtleties. Here's how we receive messages on a site I develop:
- Incoming HTTP/SMSC/whatever call with message data,
- timezone is inferred depending on transport (let's say it's CET),
- either use the current UTC time, or infer from message data the timestamp and convert to UTC,
- store this UTC timestamp.
The important part: this works fine. There is absolutely nothing wrong with this process. What fails, of course, is the part you'd least expect to fail - our pagination, which is done with Epoch offsets:
- Pageful of messages is requested,
- find the last message in the dataset,
- take whatever attribute we sorted by on this particular message,
- convert that timestamp to an integer offset from Epoch.
How on earth could this fail? It's such a dead-simple task. Let's pseudocode it!
# Why isn't this part of the standard library? import time def dt2unix(dt): return time.mktime(dt.timetuple()) + (dt.microsecond / 10.0 ** 6) def next_page_offset(message_set, sort="created"): vals = sorted(message_set, key=lambda m: m[sort], reverse=True) return dt2unix(vals.next())
Seems fairly straight-forward, right? Take the lowest value for the key by which the entire dataset was sorted and make use of the fact that the next page's max(m[sort]) is less than this page's min(m[sort]).
As it turns out, no - this will behave in very odd ways on some machines. Yes, some machines - which ones will become evident soon.
The error is best explained by converting the Epoch offsets into datetime objects again:
def unix2dt(offset): return datetime.datetime.fromtimestamp(float(offset))
We should expect unix2dt(dt2unix(dt)) == dt to hold, and it does! So what's the fuzz about? Well...
>>> print dt 1970-01-01 00:00:00+00:00 >>> unix2dt(dt2unix(dt)) datetime.datetime(1970, 1, 1, 0, 0) >>> dt2unix(dt) -3600.0
Uh-oh. Shouldn't this give zero..? Well, no. The answer is that datetime.datetime.fromtimestamp and time.mktime both work not with naïve timestamps, but with local time.
So, the time functions in this case are compensating for the local timezone, which is CET! Hardly something you want them to be doing, I'd argue, for something like an offset (since it will lead to the same offset occurring multiple times during DST adjustment) - but hey.
We now know what is wrong, but how do we make them stop? One way is to just use UTC as the timezone. Server owners should really do this anyway, but timezones exist and they serve a good reason.
This is perhaps the second issue with Python's standard library and timestamps: it does nowhere mention how to do it without adjustment, except in the online HTML documentation. Personally I like using pydoc as a reference...
Converting from a UNIX timestamp into a non-compensated datetime is fairly easy: just use utcfromtimestamp and there we are (not sure why it says UTC though, this could really be any offset that shouldn't be compensated.)
Converting to a UNIX timestamp without compensating is a bit less obvious, even to seasoned Pythonistas because what you're looking for is actually calendar.timegm.
So we can now rewrite our two converter functions to look like this:
def dt2unix_utc(dt): return timegm(dt.timetuple()) + (dt.microsecond / 10.0 ** 6) def unix2dt_utc(offset): return datetime.datetime.utcfromtimestamp(offset)
Again with the *_utc thing, should probably be called *_global or something. I elaborated this into using the very excellent pytz library to set the UTC tzinfo.
So, that will be all then! The complexity of this really speaks for why Python should incorporate a set of functions for doing these things. I'd suggest making it obnoxiously clear that a compensation can be made:
- from_epoch(offset, compensate=False)
- to_epoch(dt, compensate=False)
If you pass a naïve datetime to these and tell them not to compensate, I think an exception should be raised. The point in compensating, I presume, is to make UNIX timestamps comparable across timezones. Ah well.