Django, Unicode and Caching
So yeah, I was trying to insert marshalled (from Python's marshal module) data into memcached via Django's cache support, and I got this error:
230. data = cache.get(key) File "django/core/cache/backends/memcached.py" in get 30. return smart_unicode(val) File "django/utils/encoding.py" in smart_unicode 44. return force_unicode(s, encoding, strings_only, errors) File "django/utils/encoding.py" in force_unicode 92. raise DjangoUnicodeDecodeError(s, *e.args)
In force_unicode. What the hell? I'm retrieving data from memcached, and basically Django tries to protect me from the oh-so-confusing world of encodings, and at the same time assumes nobody will never ever store binary data in a cache.
Well, this is really a Python problem :)
The issue here is that you have some binary data that you need to treat as a chunk of bytes. But Python 2.x doesn't have a data type for that, and so you end up putting it into a string with dumps() and sending the string to memcached.
Django, meanwhile, has a policy of always using Unicode strings everywhere, since mixing bytestrings and Unicode strings is the path to madness. So when Django sees a string coming back out of the cache, it wants to make sure it's a Unicode string, and doesn't have any way to know that it's really non-string-data-masquerading-as-a-string.
The solution is about ten lines of code and some careful use of cache keys. Write a subclass of the memcached cache backend, and override get(), and decide on a prefix or suffix you'll use for keys that are binary-values-hiding-in-strings. Then have get() skip the Unicode conversion when you're fetching one of those keys.
Re: James Bennett,
I know, but IMO a cache like memcached stores byte strings, not text strings.
Why can't Django set a flag in memcached instead, saying "this is a Unicode string", so it knows on get? This is what the popular memcached libraries do for other datatypes, such as integers and longs.
Also, a solution I once used in another context where a Unicode string was required (a JSON data format), was to simply decode the data with ISO-8859-1 (which never fails), then on the receiving side, encode it with ISO-8859-1.
