Making A Spam Filter

It started out as an itch. An itch I had to scratch. A publishing platform I use forwarded a good deal of comment spam to my personal e-mail, so I decided to do something about it. I shot off an e-mail.

Years passed. The spam comments continued trickling in to my mailbox. One day, almost as if by coincidence, I found myself working for this very publishing platform. It was finally time to scratch the itch.

Considerations were considered. Thoughts were thought. Plans were planned. The project that would come to be known as the "War on Spam" was afoot. The goals were simple enough:

  • Learning: it's useless to try to come up with rules to match all the spam. Instead the system must learn from user guidance.
  • Assisted: while rule writing is out there still is a point to giving hints about features (like very long body, number of links, time of day, entry age, etc).
  • Debuggable: it has to answer the question "why is this comment spam?" Statistics are essential.
  • Resilience: spammers know about self-learning classifiers and try to cheat them. We musn't be fooled.

We chose a so-called naive Bayes classifier. It uses Bayesian probabilities, a curious invention. It can answer the question "what is the probability of the sun shining given that the lawn is dry?" by looking at the probability that the lawn is dry; that the sun shining; and that, given the sun is shining, the lawn is dry.

I'm satisfied with the result. In the end I integrated SpamBayes, a popular open-source spam filtering solution.

There are of course kinks to work out, for one there's a strong language bias in the training sample. Almost all the spam is in English, and ham in Swedish. I'm not too worried though.

Go ahead and try it out, make a comment!


Basic Git integration with Google App Engine

So we at sendapatch.se use Git a lot for the our productions. It's great.

One vital detail you need to know in App Engine development is what’s up there, … y'know, up in the cloud.

The solution is as obvious as it is useful and tedious to write, for your consideration: a script that updates App Engine from the current HEAD and bookmarks it in a branch named uploaded-<version>.

#!/bin/bash

# updates appengine from current HEAD and puts tree in uploaded-<version>

BASE=.
PYTHON=python2.7
APPENGINE=$BASE/google_appengine
APP=$BASE/app
VERSION="$(grep '^version: ' "$APP/app.yaml" | cut -c 10-)"
BRANCH="uploaded-$VERSION"

echo "Creating snapshot for $VERSION" >&2

TREE="$(git write-tree)"
COMMIT="$(git commit-tree $TREE -p HEAD <<COMMIT
Upload $VERSION to Google App Engine
COMMIT
)"
echo "commit $COMMIT"
git branch -f "$BRANCH" "$COMMIT" || exit

exec "$PYTHON" "$APPENGINE/appcfg.py" update app "$@"

pylibmc 1.2.3 release candidate

A new pylibmc is in the works and I would love it if I could get some attention to this, the release is available on the Google Group thread:

pylibmc 1.2.3 release thread

 


Ten Ways to Solve DNS Problems (or: the web is amazing)

So I wrote about my woes with DNS, bemoaning how our VPS provider GleSYS's DNS servers were not performing well enough. As usual with the web, I was blown away by the feedback; not only did I get over a dozen tips on what to do, GleSYS themselves chimed in to say they've fixed the problem.

Either that's a PR move on their part, or their technicians are very attentive. I'd like to think the latter. So without further ado, here are the ten ways in which to solve the case of the slow DNS look-up:

There are of course pros and cons to every single one of these options above, and I'll just quickly address some obvious questions.

First up, BIND. As much as I love ISC software, BIND feels a little too heavy-duty for a one-off thing like this.

djbdns is, I'm sure, quality software too; here the problem is deployment. For djbdns, "integrating with the OS" means "write your own rc replacement and shove it down people's throats". I refer of course to the bane that is daemontools. I gave it a shot with qmail, never ever again.

As for OpenDNS and Google Public DNS, I'd have to benchmark them over a week or so to know what to think of them. However I'd much prefer to do business with people who will be accountable for downtime.

By far the most interesting of them is Unbound, because of what it says on the box: a lightweight caching DNS server.

For now it looks like GleSYS have fixed things on their end; if this becomes a problem again, it might be better to change VPS provider.


GleSYS, Y U NO DNS?

... or why DNS lookups are a dangerous thing.

At my current employer we specialize in making campaigns, and this particular one is a Facebook Canvas type of thing, meaning we talk to the Facebook API.

It turns out though, one day after launching the campaign, that the local DNS resolver is sometimes unable to resolve the name facebook.com or graph.facebook.com in a timely fashion.

Looking into the matter I wrote a script for benchmarking the performance of socket.gethostbyaddr(), for your convenience as well as future reference:

#!/usr/bin/env python2.6

import sys, time, socket

ts = []
def test_host(h):
    t0 = time.time()
    try:
        socket.gethostbyaddr(h)
    except:
        print "resolve failed", repr(h)
    ts.append(time.time() - t0)

def avg(L): return sum(L)/float(len(L))
def med(L):
    L=list(sorted(L))
    if len(L)&1:
        return L[int(len(L)/2)]
    else:
        return (L[int(len(L)/2)-1]+L[int(len(L)/2)])/2.0

t0 = time.time()
test_host("facebook.com")
test_host("www.facebook.com")
test_host("graph.facebook.com")
test_host("api.facebook.com")
test_host("api-read.facebook.com")
test_host("api-video.facebook.com")
print "started %.2f, completed in %.2f" % (t0, time.time() - t0)
print "slowest %.4f, fastest %.4f" % (max(ts), min(ts))
print "median %.4f, average %.4f" % (med(ts), avg(ts))

We use GleSYS for our VPS needs, which is a common provider in Sweden. Guess what their DNS performance looks like? Sometimes it takes up to 40 seconds for them to resolve facebook.com, when two seconds earlier they could answer the query in under 1ms.

For now I just chucked the relevant hostnames into /etc/hosts, so: I could use a tip on a lightweight recursive DNS server! (Not BIND or djbdns.)


Force a Git branch to remain merged with master

At my workplace, we decided we should have two branches that are automatically rolled out on development and production servers respectively, and so I set out to ascertain that developers first make sure the master branch works; I thought the end-result would be useful to others so here it is:

#!/usr/bin/env python2.6

# assert that updating refs/heads/dev or refs/heads/prod is not possible
# without first putting that commit into the ancestry of refs/heads/master.

import sys
import subprocess

master_ref = "refs/heads/master"
checked_refs = ("refs/heads/dev", "refs/heads/prod")

def git_merge_base(a, b):
    "find the earliest common ancestor of a, b"
    args = ["git", "merge-base", a, b]
    p = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=sys.stderr)
    if p.wait() != 0:
        sys.stderr.write("git-merge-base exited %d\n" % p.returncode)
        sys.exit(128)
    return p.stdout.read().strip()

def check(cid):
    base = git_merge_base(cid, master_ref)
    if base != new:
        sys.stderr.write("%s is not an ancestor of %s\n"
                         "%s diverges at %s\n"
                         % (cid, master_ref, new, base))
        sys.exit(1)
    return True

if __name__ == "__main__":
    for line in sys.stdin:
        (old, new, refname) = line.strip().split(" ", 2)
        if refname in checked_refs:
            check(new)

Warts of Python: Ternary Expressions

The problem with Python's ternary operator is that it breaks up the two contrasting values on two opposite sides of the expression.

To break it up into how I myself work with it, let's look at an example that made me opt for a hack.

form.data["type"] = "business" if request.form.get("business") else "private"

And the hack,

form.data["type"] = ("private", "business")[bool(request.form.get("business"))]

The reason is obvious: the above consolidates the literal data to one side of the expression, making it easier to follow the code. (Apart from the occasional guy who doesn't know about the perverted powers of indexing by boolean values!)

The only time the Python ternary works is with very simple conditions, no strike that -- with very short conditions. Complexity has nothing to do with it.

business = bool(request.form.get("business"))
form.data["type"] = "business" if business else "private"
##
get = request.form.get
form.data["type"] = "business" if get("business") else "private"

So I submit that the BDFL made a mistake. The ternary expression in Python sucks.


Google App Engine disregards Accept-Encoding

$ curl -H 'Accept-Encoding: gzip' -A 'Random/5.0' http://url |file -     
/dev/stdin: ASCII text, with very long lines, with no line terminators
$ curl -H 'Accept-Encoding: gzip' -A 'Random/5.0 gzip' http://url |file -
/dev/stdin: gzip compressed data, max compression

Google, in their infinite wisdom, have decided that the Accept-Encoding header is essentially useless.

Why I don't know, but if you deploy with Google App Engine and rightfully expect their boundary proxies to compress your data for you, then you have to make sure your clients's User-Agent is either whitelisted or contains the magic four bytes gzip. (So if your client happens to be named "tagzipper" for example, Google will gzip for you.)

In their defense this is documented behavior, but it doesn't make it any less quirky. As it turns out, we hit the billed quota we set because our clients aren't including "gzip" in their UA strings... D'oh.

Might be good to know!


Working with UNIX timestamps in Python

So, like many others, I thought I knew how to work with UNIX timestamps in Python. After all, I pride myself on being pretty well-versed in my one favorite language!

However, I also like many others, started using MongoDB recently. MongoDB really forces you to think about timezones, so I argued to myself that this is the equivalent of going from str to unicode - more work, little payoff, but a lot more correct and in the end saves multiple headaches.

So MongoDB stores Epoch offsets like many others, in 64-bit unsigned integers. It doesn't store any timezones or anything (note: timestamps without definite timezones are called naïve.)

Why should you care? What's the point, you ask? Where is the culmination? Well as it turns out, there are subtleties. Very subtle subtleties. Here's how we receive messages on a site I develop:

  1. Incoming HTTP/SMSC/whatever call with message data,
  2. timezone is inferred depending on transport (let's say it's CET),
  3. either use the current UTC time, or infer from message data the timestamp and convert to UTC,
  4. store this UTC timestamp.

The important part: this works fine. There is absolutely nothing wrong with this process. What fails, of course, is the part you'd least expect to fail - our pagination, which is done with Epoch offsets:

  1. Pageful of messages is requested,
  2. find the last message in the dataset,
  3. take whatever attribute we sorted by on this particular message,
  4. convert that timestamp to an integer offset from Epoch.

How on earth could this fail? It's such a dead-simple task. Let's pseudocode it!

# Why isn't this part of the standard library?
import time
def dt2unix(dt):
    return time.mktime(dt.timetuple()) + (dt.microsecond / 10.0 ** 6)

def next_page_offset(message_set, sort="created"):
    vals = sorted(message_set, key=lambda m: m[sort], reverse=True)
    return dt2unix(vals.next())

Seems fairly straight-forward, right? Take the lowest value for the key by which the entire dataset was sorted and make use of the fact that the next page's max(m[sort]) is less than this page's min(m[sort]).

As it turns out, no - this will behave in very odd ways on some machines. Yes, some machines - which ones will become evident soon.

The error is best explained by converting the Epoch offsets into datetime objects again:

def unix2dt(offset):
    return datetime.datetime.fromtimestamp(float(offset))

We should expect unix2dt(dt2unix(dt)) == dt to hold, and it does! So what's the fuzz about? Well...

>>> print dt
1970-01-01 00:00:00+00:00
>>> unix2dt(dt2unix(dt))
datetime.datetime(1970, 1, 1, 0, 0)
>>> dt2unix(dt)
-3600.0

Uh-oh. Shouldn't this give zero..? Well, no. The answer is that datetime.datetime.fromtimestamp and time.mktime both work not with naïve timestamps, but with local time.

So, the time functions in this case are compensating for the local timezone, which is CET! Hardly something you want them to be doing, I'd argue, for something like an offset (since it will lead to the same offset occurring multiple times during DST adjustment) - but hey.

We now know what is wrong, but how do we make them stop? One way is to just use UTC as the timezone. Server owners should really do this anyway, but timezones exist and they serve a good reason.

This is perhaps the second issue with Python's standard library and timestamps: it does nowhere mention how to do it without adjustment, except in the online HTML documentation. Personally I like using pydoc as a reference...

Converting from a UNIX timestamp into a non-compensated datetime is fairly easy: just use utcfromtimestamp and there we are (not sure why it says UTC though, this could really be any offset that shouldn't be compensated.)

Converting to a UNIX timestamp without compensating is a bit less obvious, even to seasoned Pythonistas because what you're looking for is actually calendar.timegm.

So we can now rewrite our two converter functions to look like this:

def dt2unix_utc(dt):
    return timegm(dt.timetuple()) + (dt.microsecond / 10.0 ** 6)

def unix2dt_utc(offset):
    return datetime.datetime.utcfromtimestamp(offset)

Again with the *_utc thing, should probably be called *_global or something. I elaborated this into using the very excellent pytz library to set the UTC tzinfo.

So, that will be all then! The complexity of this really speaks for why Python should incorporate a set of functions for doing these things. I'd suggest making it obnoxiously clear that a compensation can be made:

  • from_epoch(offset, compensate=False)
  • to_epoch(dt, compensate=False)

If you pass a naïve datetime to these and tell them not to compensate, I think an exception should be raised. The point in compensating, I presume, is to make UNIX timestamps comparable across timezones. Ah well.


Werkzeug & Reloading

I got tired of Werkzeug's runserver being so slow to reload (it polls once every second) combined with VMware's "shared folder" thing being so slow to refresh. It sometimes took up to 3-4 seconds to reload. Clearly not doable for a guy with over nine thousand words per minute!!1

So, like any responsible sendapatch.se member, I decided to take matters into my own hands and send a patch.

First, I modified Werkzeug's serving.py so it listens to SIGHUP for reloading itself.

Then, I run this line in a shell on my Mac OS X:

while until_changed **/*.py **/*.html **/*.jst 2>/dev/null; do
  python - <<PY
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.sendto("PING\n", ("VMWARE_IMAGE", 12345))
s.close()
PY
  date
done

until_changed is a utility I wrote that uses Apple's FSEvents APIs from Python. It exits as soon as any of the given arguments is changed.

Then, on my VMware Linux image, I run this:

socat udp-recv:12345 stdout | \
(while read; do
     pkill -HUP -f 'dev.py runserver'
 done)

That made it reload very quickly. In fact so quickly that the filesystem on the Linux guest OS didn't have time to refresh before Werkzeug loaded the code again, so it started reloading twice. And still just as slow. Once for my immediate hook, once for detected filesystem changes.

So that was clearly untenable. A non-solution if you will. Scratch the SIGHUP stuff. (I still have the changes if anybody is interested.)

Without NFS, rsync or something to that extent, would have been an exercise in pointlessness. But fear not. Setting up the OS X side to do rsync is so trivial! See:

rsync -ave ssh ./ rsync://lericson@VMWARE_IMAGE:devel/src/

Slap this into a similar while loop to the ones above, and you've got yourself a simple and effective synchronized directory. (The reason I chose to use rsync's own protocol is that it has very little initial negotiation -> schnappiness.)

Next up: Werkzeug's runserver needs pyinotify support, to reload near-instantaneously as the rsync completes.

Update So I made inotify support for Werkzeug, and Ronacher is going to merge it any minute now. Totally awesome.


Google and Magnets

thepiratebay.org and other torrent trackers have been taken down in an international action against piracy. I'm not going to comment on the events themselves, but what struck me as funny is as follows:

The Pirate Bay, and others, have long been using so-called magnet links - essentially just the torrent info hash in a URL, often coupled with a human-readable display name.

Now, here is the conundrum: Google are known to index the Web, and have indexed thepiratebay.org and other listings of magnet links. Google are also known to provide their cached version of content they've indexed. At least according to Swedish court, linking to means by which acquiring intellectual property is made possible, is not different from distributing IP directly.

As thepiratebay.org is down today, I did use Google to acquire content. So I think Google really, in every sense possible, are at least as enabling as TPB itself is -- of course, Google doesn't condone this behavior, so therein lies a difference.

At any rate, I wouldn't be surprised if Google started censoring out magnet links or otherwise preventing this kind of abuse of their cache. But for now it's surprisingly effective. :-)


simples3 reaches the magic 1.0 release

Good news, everyone! Today I finally got around to releasing the big one point oh for one of my libraries, simples3.

simples3 is a dead-simple interface to Amazon S3, the storage service. The API is Pythonic, aims to stick to your memory like iron filings to a strong electromagnet.

>>> bucket = simples3.S3Bucket("foo", access_key="abc", secret_key="def")
>>> bucket.put("myfile.txt", "Hello world, file!")
>>> bucket.get("myfile.txt").read()
'Hello world, file!'

Then there's the mapping like interface, which is perhaps easiest to remember:

>>> bucket["myfile.txt"] = "Hello world, file!"
>>> bucket["myfile.txt"].read()
'Hello world, file!'
>>> "myfile.txt" in bucket
True
>>> del bucket["myfile.txt"]

For a more extensive usage example, see the documentation. The project is available on Github, and simples3 1.0 is available on PyPI.


Reading a network Seismometer from Python

So as I've written earlier, I've been involved in parts of the making of an iPhone app called Seismometer which quite simply is what it's named.

Now, the interesting thing about this Seismometer app is the network protocol: you can use your iPhone as an input device to your computer or other handheld.

I've made a small Python library to help interpret that protocol in a simple manner.


screencast on youtube.com

As you can see in the screencast above, the accelerometer lets you know which side is facing up or down in all three axes of the device.

Using the library is simple, just a matter of:

import rattler

for meas in rattler.measurements():
    print meas

This could be used to make a Wiimote or something, I don't know. That's where I want your creativity.

I have a couple of gift codes for the iPhone app, give me a comment here and I'll see what I can do.

The protocol is open, see http://yellowagents.com/seismo_protocol.html for more details on that.

The Python code is of coures open-source too, and is available on Github.


How do you Handle Job Failures, really?

Keith Rarick, the author of beanstalkd wrote a blog post detailing how he thinks we should handle failures in a queue-based worker-type-of-thing.

Now, I'm not one to bitch, but I think the man has overlooked some things.

Let me knock this up a notch and explain quickly how beanstalkd itself handles failures. First, a worker reserves a job. Then, a worker deletes that job to indicate doneness. If the worker noticed it couldn't perform that job, but that another worker might, it must release that job.

Lastly, if the job goes to shits, you bury the job. It then ends up in the so-called buried queue. This is a little confusing, because namespaces in beanstalkd are called tubes, but every tube can have buried jobs, so buriedness becomes some sort of binary meta namespace. Ah well, I digress.

Then, the neat part here is that beanstalkd lets you kick up jobs again. You tell it "yo, kick up five jobs" (really you can and must only specify the number of jobs, not sure why) and they reappear as regular jobs in the queue.

Obviously, not all jobs can be kicked up again. Well, they can, but it doesn't always make sense. You can work around that with deadlines and so on, but it's one thing to keep in mind - for example, taking a GPS sampling of a cellphone because of some user-triggered event becomes meaningless after 15 minutes, since the cellphone could've moved in that time.

You can't set an expiry on the actual bury command, but that's just sugar in a way. I'd advise people to always include some kind of "created" datum in their job descriptions.

A much more salient issue with the concept of burying is the recently-introduced binlog.

The binlog works like any other binlog: new jobs comes in, log it; job reserved, log it; and so on. When a job is finally deleted from the queue, it no longer needs to have a log record, and when all the log records in a binlog partition are "unnecessary", the whole 10MB partition can be freed. This is how the binlog doesn't grow to over 9000 gigabytes.

Except that it does. The problem is that jobs are logged when they get buried, too. This might make sense at a first glance, and it does - they if any should be persisted to disk, surely.

Well, consider this scenario: you pump 10MB of data through beanstalkd every minute, and one job fails. Every other job is marked as done. Net result? You're filling the partition wherein the beanstalkd binlog is at a rate of 10 MB/minute.

For me, this has resulted in various horror scenarios on production machines because not only does it fill up the binlog on disk, it makes the OOM killer trigger happy. That's right: the binlog exists in memory as well, and it can grow to ridiculous proportions.

So, what's the fix for this obvious misfeature? Simple: store the buried jobs in a separate binlog. They should've been in the first place if you ask me, since they're a separate namespace anyway.

What did we wind up doing to remedy the situation? We downgraded to a version of beanstalkd that didn't have binlogs.


Python "fish" module announced

Some people have already hilighted my fish module, but I hadn't yet really finished it.

Short introduction: the module animates a fish (or any other ASCII art), and you want to see the screencast below.

fish on PyPI


SQLAlchemy and dbshell

I found Django's manage.py dbshell an invaluable command when I developed with Django, and have been longing for an equivalent.

Not finding anything satisfactory on Google for "SQLAlchemy dbshell", I wrote this piece of code for my Werkzeug-based projects' manage.py scripts:

from sqlalchemy.engine.url import make_url

def url2pg_opts(url, db_opt=None, password_opt=None):
    """Map SQLAlchemy engine or URL *url* to database connection CLI options

    If *db_opt* is None, the database name is appended to the returned list as
    an argument. If it is a string, that option is used instead. If false, the
    database name is ignored.

    If *password_opt* is set, a potential password will be set using that
    option.
    """
    if hasattr(url, "url"):
        url = url.url
    url = make_url(url)
    connect_opts = []
    def set_opt(opt, val):
        if val:
            connect_opts.extend((opt, str(val)))
    set_opt("-U", url.username)
    if password_opt:
        set_opt(password_opt, url.password)
    set_opt("-h", url.host)
    set_opt("-p", url.port)
    if db_opt:
        set_opt(db_opt, url.database)
    elif db_opt is None and url.database:
        connect_opts.append(url.database)
    return connect_opts

def action_dbshell():
    bin = "psql"
    args = [bin] + url2pg_opts(make_app().db_engine)
    os.execvp(bin, args)

def action_dbdump(file=("f", "")):
    if not file:
        print "specify a dump file with -f or --file"
        return
    bin = "pg_dump"
    args = [bin, "-F", "c", "-C", "-O", "-x", "-f", str(file)]
    args.extend(url2pg_opts(make_app().db_engine))
    os.execvp(bin, args)

def action_dbrestore(file=("f", "")):
    if not file:
        print "specify a dump file with -f or --file"
        return
    bin = "pg_restore"
    args = [bin, "-F", "c", "-x"]
    args.extend(url2pg_opts(make_app().db_engine, db_opt="-d"))
    args.append(str(file))
    os.execvp(bin, args)

It isn't very beautiful code, and it'll only work for PostgreSQL, but it's a solid base on which to write something a little smarter that works for more than PostgreSQL.

I would do it if it wasn't for the fact that I want to be able to reuse the function for dumping and restoring.

Anyway, hope this snippet might come to help somebody out someday.


What's so bad about explicit introspection?

Something that I've noticed programmers do, myself, colleagues and open-source coders included, is to structure programs so they are "testable".

This, I posit, is silly.

Phrased another way, what we do is we structure our programs to be implicitly introspectable. For example, I found myself writing something like this not long ago:

def search(request):
    results = search_get_data(request.args["q"])
    return render(results)

def search_get_data(query):
    return ask_yellow_pages(query)

This is an obvious case of what I'm talking about: implicit introspectability. What I did was to separate the actual logic and the presentation of the result.

While that might sound like a good idea in theory, it in practice results in silly and hard-to-follow abstractions (but not always, of course.)

What I suggest is explicit introspectability: rather than structuring for testing, structure for clarity and allow testing through other explicit means. It could perhaps look something like this:

from foo.utils import local

def search(request):
     results = ask_yellow_pages(request.args["q"])
     local.emit("results", results)  # Where emit is a no-op outside tests.
     return render(results)

I suspect many people will shout "madness!" at this point, because of the "clutter". If so, then consider how the above view probably looks in actual production code:

def search(request):
     query = parse_query(request.args["q"])
     results = ask_yellow_pages(query)
     logger.debug("searched %r: %d results", query, len(results))
     return render(results)

Why is it permissible to "clutter" code with debug log calls and the likes, but not testing calls? Why do we dance around this problem with mock objects and overly convoluted abstraction layers?

I'm not saying this is a technique to be applied in every situation, but I am indeed saying that there are times and places where a simple system like this could really help making good, rigorous tests possible.

The best counter-argument I see is the so-called observer effect. Simply put, if the emit function in our case would malfunction or otherwise alter flow of logics, tests and actual code would have a disconnect.

There's a point to be had there, but is it really a practical problem? And haven't we already breached this barrier with mock objects, fixture data and whatnot?

I should note I'm a novice when it comes to testing practices - I've never bothered to try actual TDD or BDD or BDSM or what-have-you-driven-development in a real project.

So, to conclude, I'd love to hear other people's thoughts on this. Is this just a senseless taboo among programmers? Do people already do it?


Reconstructing a module from a Python bytecode cache file

So, sometimes you lose your configuration files and they happen to have a .pyc version that still exists.

>>> import marshal
>>> f = open("imgapi/conf.pyc")
>>> f.seek(8)
>>> o = marshal.load(f)
>>> o
<code object <module> at 0x7f84ff7f8c60, file ".../imgapi/conf.py", line 1>
>>> dis.dis(o)
  1           0 LOAD_CONST               0 (-1)
              3 LOAD_CONST               1 (('*',))
              6 IMPORT_NAME              0 (imgapi.conf_defaults)
              9 IMPORT_STAR

  2          10 LOAD_CONST               2 ('gevent')
             13 STORE_NAME               1 (url_fetcher)

  3          16 LOAD_CONST               6 (('.yo-toad.se:8000', '.yo-dev.se'))
             19 STORE_NAME               2 (rewrite_host)
             22 LOAD_CONST               5 (None)
             25 RETURN_VALUE
>>>

And from this, one can see what this (rather short) configuration file used to say:

from imgapi.conf_defaults import *

url_fetcher = "gevent"
rewrite_host = (".yo-toad.se:8000", ".yo-dev.se")

Now for some delicious food.


simples3 1.0-alpha

In keeping with my new year's resolution to release one-dot-ohs of my two mature projects, I now give you the 1.0 alpha release of simples3.

In case you missed it, simples3 is a simple S3 interface that has no dependencies other than Python, and has a daughter project named gaes3, which makes it work on Google App Engine (although some people have noted that simples3 should work on its own just fine.)

Without further ado:

simples3 1.0-alpha


Return Statements and Parentheses

This was originally posted on my old blog on the 24th of February, 2008

I've been trying to come up with an argument as to why you shouldn't be using parentheses around your return statements in Python for a few minutes, and I got it. Consider:

>>> def myfunc():
...     return(1)
... 
>>> myfunc()
1

Above is the classic example of people treating the return statement as a function. This could go bad, consider:

>>> def myfunc():
...     return()
... 
>>> myfunc()
()

An empty parenthesis pair is the literal for empty tuples in Python, which I find is a good reason, beyond aesthetics.


Tidigare inlägg
RSS 2.0