Making A Spam Filter

It started out as an itch. An itch I had to scratch. A publishing platform I use forwarded a good deal of comment spam to my personal e-mail, so I decided to do something about it. I shot off an e-mail.

Years passed. The spam comments continued trickling in to my mailbox. One day, almost as if by coincidence, I found myself working for this very publishing platform. It was finally time to scratch the itch.

Considerations were considered. Thoughts were thought. Plans were planned. The project that would come to be known as the "War on Spam" was afoot. The goals were simple enough:

  • Learning: it's useless to try to come up with rules to match all the spam. Instead the system must learn from user guidance.
  • Assisted: while rule writing is out there still is a point to giving hints about features (like very long body, number of links, time of day, entry age, etc).
  • Debuggable: it has to answer the question "why is this comment spam?" Statistics are essential.
  • Resilience: spammers know about self-learning classifiers and try to cheat them. We musn't be fooled.

We chose a so-called naive Bayes classifier. It uses Bayesian probabilities, a curious invention. It can answer the question "what is the probability of the sun shining given that the lawn is dry?" by looking at the probability that the lawn is dry; that the sun shining; and that, given the sun is shining, the lawn is dry.

I'm satisfied with the result. In the end I integrated SpamBayes, a popular open-source spam filtering solution.

There are of course kinks to work out, for one there's a strong language bias in the training sample. Almost all the spam is in English, and ham in Swedish. I'm not too worried though.

Go ahead and try it out, make a comment!


Got Things Done

People seem so stressed out today, yet we live in the era of automation where you can actually buy a pair of shoes on your morning commute while taking care of business. Odd, isn't it?

Think about three things you did prior to reading this, how did you end them? Chances are you didn't properly, "alright that's 90% of the job - I'll deal with the rest tomorrow." In the words of Ronan Keating, what if tomorrow never comes?

Those 10% are often harder than the 90% you think you already did. Why this is and the psychology behind it is an interesting question in itself but let's not get caught up in details.

Get things done and be done with them. Learn to love the word closure, "the act or process of closing something" or "a sense of resolution or conclusion at the end of an artistic work."

As they say, when one door closes another one opens. This is true of your employment, marital status and other big things in life but also the small ones.

When your agenda is without end you'll be prone to cut corners in a meaningless struggle aganist an enemy you'll never beat.

How will you ever have a productive day if what you have on your desk is always leftovers from yesterday? In the words of Reinhold Niehbur,

God, give us grace to accept with serenity
the things that cannot be changed, courage
to change the things which should be changed,
and the wisdom to distinguish the one from the other.

Close things. Be done with them. Say goodbye.


Code editor?

It's interesting that we think of code as this human-to-machine language.

Sometimes we remind ourselves that yes, somebody is going to have to read this in a while.

Think of that. What you write is... text, like a book!

What people do in book publishing is relevant to programming. We should have editors too, somebody to go through our code and say "nah, this needs work."

I've always been of the mind that programming is inherently a two-man job. This is perhaps the reason.


Vacation

The company is on vacation until the second week of august.

Take care!


Basic Git integration with Google App Engine

So we at sendapatch.se use Git a lot for the our productions. It's great.

One vital detail you need to know in App Engine development is what’s up there, … y'know, up in the cloud.

The solution is as obvious as it is useful and tedious to write, for your consideration: a script that updates App Engine from the current HEAD and bookmarks it in a branch named uploaded-<version>.

#!/bin/bash

# updates appengine from current HEAD and puts tree in uploaded-<version>

BASE=.
PYTHON=python2.7
APPENGINE=$BASE/google_appengine
APP=$BASE/app
VERSION="$(grep '^version: ' "$APP/app.yaml" | cut -c 10-)"
BRANCH="uploaded-$VERSION"

echo "Creating snapshot for $VERSION" >&2

TREE="$(git write-tree)"
COMMIT="$(git commit-tree $TREE -p HEAD <<COMMIT
Upload $VERSION to Google App Engine
COMMIT
)"
echo "commit $COMMIT"
git branch -f "$BRANCH" "$COMMIT" || exit

exec "$PYTHON" "$APPENGINE/appcfg.py" update app "$@"

pylibmc 1.2.3

pylibmc 1.2.3 on PyPI

An incremental release has been born!

  • greater test coverage
  • bug fixes and clean-up
  • performance enhancements
  • portability improvements

Why upgrade? Because you want to.

pylibmc 1.3.0 is around the corner, with new features! Come hang out in #sendapatch and say what you would want in the new version.


Your Mountain

We all have to climb our own mountains to get the things we want so it must be instructive to look at people who actually do climb.

  • Stop and rest
  • Get good shoes
  • Use safety rope
  • It's lonely at the top


The Value in Software

This world of producers we live in is horrible. “PRODUCE!” says society, “it’s the only way to make money.” Wait. Isn’t money just quantifying value? What do I need money for if value is where it’s at?

Value is meaning to the world, it’s information encoded in a common language. Creating it is just a matter of translating some valuable information into a useful language.

A language is not simply that which can be expressed with letters and punctuation – for you see there is spoken language, body language, mathematical language, academic language, financial language, political language, computer language, jargon language, musical language, visual language–the list never ends.

Below, the secret algorithm, the three steps to creating value:

  1. learn some valuable information – “there are 249 people here, two train tracks, the train for Uppsala departs at 13:51, two people jogging towards the second track look stressed out”
  2. distill to its meaning – why? how? etc.
  3. encode in a language – “most people get onto trains in time, some don’t”

At the end of the process is a product, but it isn’t the product that is important. It’s the meaning! Reiterate this process, starting out with knowing that people need to get on trains in time – so they need to know when trains depart. Value!

There is no way to learn a language without using it, and what you create in a language is a candidate to becoming part of the language. The implication is that a language adapts to whatever it is used for.

Linguists have known this for a long time and software developers too – “open-source software” is the name of a global language for programming, spearheaded by GitHub and Bitbucket who create value by enabling its communication and to a large degree its existence. Learn “Linus Torvalds uses e-mail to manage patches for Linux,” means “developers need to collaborate.”

The software industry is concerned with making programs to create value in other languages. Facebook is concerned with social language, where checkins at fabulous places with fabulous people consuming fabulous products is the name of the game. Twitter is concerned with news language, EXTRA EXTRA, shout loud, shout immediately, shout often.

What valuable information do you have, and what can you distill from it? Leave a comment below the break!


Annoyance #1023

People who reply in chat ending with a comma,


Introducing the Sleepytime Clock

Sleepytime Clock is meant to help you figure out when to set the alarm in order to match your sleeping cycles. Most people will know by now that one wants to wake up in the early phases of the sleep cycle, and this is my analog take on the problem.

Inspiration comes from sleepyti.me.


Get it done

Procrastinating too much? Start out small, just a corner, go for it.


pylibmc 1.2.3 release candidate

A new pylibmc is in the works and I would love it if I could get some attention to this, the release is available on the Google Group thread:

pylibmc 1.2.3 release thread

 


So you're a joker?

I read an article by some guy called SM on the subject of jokers, he's saying the world is full of jokers - people who talk a lot but do little.

I am a fuck-up at my current workplace - I handle sick leaves poorly, I show up for work five minutes late rather than five minutes early; I am a fuck-up at house chores - I rarely do the dishes, laundry is everywhere, cleaning is the last thing I think about; I sometimes fuck up with friends - I miss out on keeping in touch, I borrow money and forget about it, I hit on some poor guy's ex, the list goes on.

I am not a fuck-up in my true nature, in fact I'm probably more of an over-zealous Asperger kid inside. I don't give up before it's too late, and I find a way when I need to. I move heaven and earth, as SM puts it.

At first the logics seem counter-intuitive, but really it's an ages old problem: you have an infinite set of chores, and a limited rate of chore churning. How do you balance the workload; what do you do well, half-assed and not at all? More often than not, there is a conflict of interest between the various aspects of life. You have to call the shots.

The todo list is the only way to avoid being a joker. You will have to defer tasks. That's just reality. You will sometimes defer tasks up to a point where you realize, "ah man wish I was going to do this but I'm not." That's not being a joker, that's just you being rational.

So while I agree that it's a good thing to go into tunnel vision mode and just churn out a product in no time, it's also not a viable lifestyle. SM makes it seem as if the only way to live is 150% speed all the time and get rich.

Call me complicated, but I want more out of life than that. If what it takes to make piles of money is complete tunnel vision, then I shall have none of it. Let me sit smug-faced in my middle-class bed and enjoy life before it flashes me by.


Ten Ways to Solve DNS Problems (or: the web is amazing)

So I wrote about my woes with DNS, bemoaning how our VPS provider GleSYS's DNS servers were not performing well enough. As usual with the web, I was blown away by the feedback; not only did I get over a dozen tips on what to do, GleSYS themselves chimed in to say they've fixed the problem.

Either that's a PR move on their part, or their technicians are very attentive. I'd like to think the latter. So without further ado, here are the ten ways in which to solve the case of the slow DNS look-up:

There are of course pros and cons to every single one of these options above, and I'll just quickly address some obvious questions.

First up, BIND. As much as I love ISC software, BIND feels a little too heavy-duty for a one-off thing like this.

djbdns is, I'm sure, quality software too; here the problem is deployment. For djbdns, "integrating with the OS" means "write your own rc replacement and shove it down people's throats". I refer of course to the bane that is daemontools. I gave it a shot with qmail, never ever again.

As for OpenDNS and Google Public DNS, I'd have to benchmark them over a week or so to know what to think of them. However I'd much prefer to do business with people who will be accountable for downtime.

By far the most interesting of them is Unbound, because of what it says on the box: a lightweight caching DNS server.

For now it looks like GleSYS have fixed things on their end; if this becomes a problem again, it might be better to change VPS provider.


GleSYS, Y U NO DNS?

... or why DNS lookups are a dangerous thing.

At my current employer we specialize in making campaigns, and this particular one is a Facebook Canvas type of thing, meaning we talk to the Facebook API.

It turns out though, one day after launching the campaign, that the local DNS resolver is sometimes unable to resolve the name facebook.com or graph.facebook.com in a timely fashion.

Looking into the matter I wrote a script for benchmarking the performance of socket.gethostbyaddr(), for your convenience as well as future reference:

#!/usr/bin/env python2.6

import sys, time, socket

ts = []
def test_host(h):
    t0 = time.time()
    try:
        socket.gethostbyaddr(h)
    except:
        print "resolve failed", repr(h)
    ts.append(time.time() - t0)

def avg(L): return sum(L)/float(len(L))
def med(L):
    L=list(sorted(L))
    if len(L)&1:
        return L[int(len(L)/2)]
    else:
        return (L[int(len(L)/2)-1]+L[int(len(L)/2)])/2.0

t0 = time.time()
test_host("facebook.com")
test_host("www.facebook.com")
test_host("graph.facebook.com")
test_host("api.facebook.com")
test_host("api-read.facebook.com")
test_host("api-video.facebook.com")
print "started %.2f, completed in %.2f" % (t0, time.time() - t0)
print "slowest %.4f, fastest %.4f" % (max(ts), min(ts))
print "median %.4f, average %.4f" % (med(ts), avg(ts))

We use GleSYS for our VPS needs, which is a common provider in Sweden. Guess what their DNS performance looks like? Sometimes it takes up to 40 seconds for them to resolve facebook.com, when two seconds earlier they could answer the query in under 1ms.

For now I just chucked the relevant hostnames into /etc/hosts, so: I could use a tip on a lightweight recursive DNS server! (Not BIND or djbdns.)



Ludvig

RSS 2.0