Base64 encoding in URLs

Some time ago I was tasked with making integer IDs in URLs be base64 encoded, so they look less sequential and are more compact (for larger ID numbers, obviously.)


Python has no utility to do this in the standard library, but obviously it does have the tools for it.


First off, you don't want to be encoding the integer's decimal representation. That's not going to save you a lot of bytes at all.


What you do want to encode is the representation with the highest base possible -- and that's, shockingly, the binary representation of integers. I'm not going to go into details on how these work (because honestly, I expect my readers to know this - if not, Google ASAP.)


So that gives us the first key part of it: the struct module. It's a great tool when working with binary data in Python, even if aren't really decoding C structs in Python.


The format I use is "<I", which means "a single 32-bit integer in little endian". In retrospect I should probably have used big endian, as it's the de facto endianess used in inter-machine communication - but seriously, it matters very little which you choose in this case. Be sure to specify one though, because the default is actually whatever your architecture uses.


But, as you might know, 32-bit integers are... Well, 32-bit integers - always four bytes in length. Even when the value stored only uses the LSB (Least Significant Bit). Thus, for values below 256, only a single byte is needed; for below 65536, two bytes; for below 16777216, three bytes.


To remedy this, I get the binary representation as a Python str, and strip zero-bytes from right.


Next up is encoding with base64 - but with URL-safe characters. The convention here seems to be to replace the unsafe + and / by - and _. The base64 module actually has two functions for this, urlsafe_b64{en,de}code.


A funny quirk with base64 is that the last 6-bit value might sometimes need to be padded. This is done with equals signs, and they're actually unnecessary. You can infer how the last value should be padded by the length of the string, padding it with = until it's a multiple of four. I do this to save bytes.


import struct
from base64 import b64encode, b64decode

"""Base64 adapted for URLs.

>>> urlb64.encode("Hello kitty")
'SGVsbG8ga2l0dHk'
>>> urlb64.encode_int(1337)
'OQU'
>>> urlb64.decode('SGVsbG8ga2l0dHk')
'Hello kitty'
>>> urlb64.decode_int('OQU')
1337
"""

import struct
from base64 import b64encode, b64decode

def encode(value):
    return b64encode(value, "-_").rstrip("=")

def decode(value):
    return b64decode(value + '=' * (4 - len(value) % 4), "-_")

def encode_int(value):
    return encode(struct.pack("<I", value).rstrip("\0"))

def decode_int(value):
    value = decode(value)
    value += "\0" * (4 - len(value))
    return struct.unpack("<I", value)[0]

RSS 2.0