How do you Handle Job Failures, really?

Keith Rarick, the author of beanstalkd wrote a blog post detailing how he thinks we should handle failures in a queue-based worker-type-of-thing.

Now, I'm not one to bitch, but I think the man has overlooked some things.

Let me knock this up a notch and explain quickly how beanstalkd itself handles failures. First, a worker reserves a job. Then, a worker deletes that job to indicate doneness. If the worker noticed it couldn't perform that job, but that another worker might, it must release that job.

Lastly, if the job goes to shits, you bury the job. It then ends up in the so-called buried queue. This is a little confusing, because namespaces in beanstalkd are called tubes, but every tube can have buried jobs, so buriedness becomes some sort of binary meta namespace. Ah well, I digress.

Then, the neat part here is that beanstalkd lets you kick up jobs again. You tell it "yo, kick up five jobs" (really you can and must only specify the number of jobs, not sure why) and they reappear as regular jobs in the queue.

Obviously, not all jobs can be kicked up again. Well, they can, but it doesn't always make sense. You can work around that with deadlines and so on, but it's one thing to keep in mind - for example, taking a GPS sampling of a cellphone because of some user-triggered event becomes meaningless after 15 minutes, since the cellphone could've moved in that time.

You can't set an expiry on the actual bury command, but that's just sugar in a way. I'd advise people to always include some kind of "created" datum in their job descriptions.

A much more salient issue with the concept of burying is the recently-introduced binlog.

The binlog works like any other binlog: new jobs comes in, log it; job reserved, log it; and so on. When a job is finally deleted from the queue, it no longer needs to have a log record, and when all the log records in a binlog partition are "unnecessary", the whole 10MB partition can be freed. This is how the binlog doesn't grow to over 9000 gigabytes.

Except that it does. The problem is that jobs are logged when they get buried, too. This might make sense at a first glance, and it does - they if any should be persisted to disk, surely.

Well, consider this scenario: you pump 10MB of data through beanstalkd every minute, and one job fails. Every other job is marked as done. Net result? You're filling the partition wherein the beanstalkd binlog is at a rate of 10 MB/minute.

For me, this has resulted in various horror scenarios on production machines because not only does it fill up the binlog on disk, it makes the OOM killer trigger happy. That's right: the binlog exists in memory as well, and it can grow to ridiculous proportions.

So, what's the fix for this obvious misfeature? Simple: store the buried jobs in a separate binlog. They should've been in the first place if you ask me, since they're a separate namespace anyway.

What did we wind up doing to remedy the situation? We downgraded to a version of beanstalkd that didn't have binlogs.

Python "fish" module announced

Some people have already hilighted my fish module, but I hadn't yet really finished it.

Short introduction: the module animates a fish (or any other ASCII art), and you want to see the screencast below.

fish on PyPI

SQLAlchemy and dbshell

I found Django's dbshell an invaluable command when I developed with Django, and have been longing for an equivalent.

Not finding anything satisfactory on Google for "SQLAlchemy dbshell", I wrote this piece of code for my Werkzeug-based projects' scripts:

from sqlalchemy.engine.url import make_url

def url2pg_opts(url, db_opt=None, password_opt=None):
    """Map SQLAlchemy engine or URL *url* to database connection CLI options

    If *db_opt* is None, the database name is appended to the returned list as
    an argument. If it is a string, that option is used instead. If false, the
    database name is ignored.

    If *password_opt* is set, a potential password will be set using that
    if hasattr(url, "url"):
        url = url.url
    url = make_url(url)
    connect_opts = []
    def set_opt(opt, val):
        if val:
            connect_opts.extend((opt, str(val)))
    set_opt("-U", url.username)
    if password_opt:
        set_opt(password_opt, url.password)
    set_opt("-p", url.port)
    if db_opt:
        set_opt(db_opt, url.database)
    elif db_opt is None and url.database:
    return connect_opts

def action_dbshell():
    bin = "psql"
    args = [bin] + url2pg_opts(make_app().db_engine)
    os.execvp(bin, args)

def action_dbdump(file=("f", "")):
    if not file:
        print "specify a dump file with -f or --file"
    bin = "pg_dump"
    args = [bin, "-F", "c", "-C", "-O", "-x", "-f", str(file)]
    os.execvp(bin, args)

def action_dbrestore(file=("f", "")):
    if not file:
        print "specify a dump file with -f or --file"
    bin = "pg_restore"
    args = [bin, "-F", "c", "-x"]
    args.extend(url2pg_opts(make_app().db_engine, db_opt="-d"))
    os.execvp(bin, args)

It isn't very beautiful code, and it'll only work for PostgreSQL, but it's a solid base on which to write something a little smarter that works for more than PostgreSQL.

I would do it if it wasn't for the fact that I want to be able to reuse the function for dumping and restoring.

Anyway, hope this snippet might come to help somebody out someday.

What's so bad about explicit introspection?

Something that I've noticed programmers do, myself, colleagues and open-source coders included, is to structure programs so they are "testable".

This, I posit, is silly.

Phrased another way, what we do is we structure our programs to be implicitly introspectable. For example, I found myself writing something like this not long ago:

def search(request):
    results = search_get_data(request.args["q"])
    return render(results)

def search_get_data(query):
    return ask_yellow_pages(query)

This is an obvious case of what I'm talking about: implicit introspectability. What I did was to separate the actual logic and the presentation of the result.

While that might sound like a good idea in theory, it in practice results in silly and hard-to-follow abstractions (but not always, of course.)

What I suggest is explicit introspectability: rather than structuring for testing, structure for clarity and allow testing through other explicit means. It could perhaps look something like this:

from foo.utils import local

def search(request):
     results = ask_yellow_pages(request.args["q"])
     local.emit("results", results)  # Where emit is a no-op outside tests.
     return render(results)

I suspect many people will shout "madness!" at this point, because of the "clutter". If so, then consider how the above view probably looks in actual production code:

def search(request):
     query = parse_query(request.args["q"])
     results = ask_yellow_pages(query)
     logger.debug("searched %r: %d results", query, len(results))
     return render(results)

Why is it permissible to "clutter" code with debug log calls and the likes, but not testing calls? Why do we dance around this problem with mock objects and overly convoluted abstraction layers?

I'm not saying this is a technique to be applied in every situation, but I am indeed saying that there are times and places where a simple system like this could really help making good, rigorous tests possible.

The best counter-argument I see is the so-called observer effect. Simply put, if the emit function in our case would malfunction or otherwise alter flow of logics, tests and actual code would have a disconnect.

There's a point to be had there, but is it really a practical problem? And haven't we already breached this barrier with mock objects, fixture data and whatnot?

I should note I'm a novice when it comes to testing practices - I've never bothered to try actual TDD or BDD or BDSM or what-have-you-driven-development in a real project.

So, to conclude, I'd love to hear other people's thoughts on this. Is this just a senseless taboo among programmers? Do people already do it?

RSS 2.0