Making A Spam Filter
It started out as an itch. An itch I had to scratch. A publishing platform I use forwarded a good deal of comment spam to my personal e-mail, so I decided to do something about it. I shot off an e-mail.
Years passed. The spam comments continued trickling in to my mailbox. One day, almost as if by coincidence, I found myself working for this very publishing platform. It was finally time to scratch the itch.
Considerations were considered. Thoughts were thought. Plans were planned. The project that would come to be known as the "War on Spam" was afoot. The goals were simple enough:
- Learning: it's useless to try to come up with rules to match all the spam. Instead the system must learn from user guidance.
- Assisted: while rule writing is out there still is a point to giving hints about features (like very long body, number of links, time of day, entry age, etc).
- Debuggable: it has to answer the question "why is this comment spam?" Statistics are essential.
- Resilience: spammers know about self-learning classifiers and try to cheat them. We musn't be fooled.
We chose a so-called naive Bayes classifier. It uses Bayesian probabilities, a curious invention. It can answer the question "what is the probability of the sun shining given that the lawn is dry?" by looking at the probability that the lawn is dry; that the sun shining; and that, given the sun is shining, the lawn is dry.
I'm satisfied with the result. In the end I integrated SpamBayes, a popular open-source spam filtering solution.
There are of course kinks to work out, for one there's a strong language bias in the training sample. Almost all the spam is in English, and ham in Swedish. I'm not too worried though.
Go ahead and try it out, make a comment!
Basic Git integration with Google App Engine
So we at sendapatch.se use Git a lot for the our productions. It's great.
One vital detail you need to know in App Engine development is what’s up there, … y'know, up in the cloud.
The solution is as obvious as it is useful and tedious to write, for your consideration: a script that updates App Engine from the current HEAD and bookmarks it in a branch named uploaded-<version>.
#!/bin/bash # updates appengine from current HEAD and puts tree in uploaded-<version> BASE=. PYTHON=python2.7 APPENGINE=$BASE/google_appengine APP=$BASE/app VERSION="$(grep '^version: ' "$APP/app.yaml" | cut -c 10-)" BRANCH="uploaded-$VERSION" echo "Creating snapshot for $VERSION" >&2 TREE="$(git write-tree)" COMMIT="$(git commit-tree $TREE -p HEAD <<COMMIT Upload $VERSION to Google App Engine COMMIT )" echo "commit $COMMIT" git branch -f "$BRANCH" "$COMMIT" || exit exec "$PYTHON" "$APPENGINE/appcfg.py" update app "$@"
pylibmc 1.2.3 release candidate
A new pylibmc is in the works and I would love it if I could get some attention to this, the release is available on the Google Group thread:
Ten Ways to Solve DNS Problems (or: the web is amazing)
So I wrote about my woes with DNS, bemoaning how our VPS provider GleSYS's DNS servers were not performing well enough. As usual with the web, I was blown away by the feedback; not only did I get over a dozen tips on what to do, GleSYS themselves chimed in to say they've fixed the problem.
Either that's a PR move on their part, or their technicians are very attentive. I'd like to think the latter. So without further ado, here are the ten ways in which to solve the case of the slow DNS look-up:
- OpenDNS
- Google Public DNS
- BIND
- djbdns
- Unbound
- Deadwood/MaraDNS
- PowerDNS
- dnsmasq
- Twisted Names
- blog about it
- pray for rain
There are of course pros and cons to every single one of these options above, and I'll just quickly address some obvious questions.
First up, BIND. As much as I love ISC software, BIND feels a little too heavy-duty for a one-off thing like this.
djbdns is, I'm sure, quality software too; here the problem is deployment. For djbdns, "integrating with the OS" means "write your own rc replacement and shove it down people's throats". I refer of course to the bane that is daemontools. I gave it a shot with qmail, never ever again.
As for OpenDNS and Google Public DNS, I'd have to benchmark them over a week or so to know what to think of them. However I'd much prefer to do business with people who will be accountable for downtime.
By far the most interesting of them is Unbound, because of what it says on the box: a lightweight caching DNS server.
For now it looks like GleSYS have fixed things on their end; if this becomes a problem again, it might be better to change VPS provider.