Simple and efficient spam prevention techniques

Originally posted in my old blog at My Opera

I've previously outlined some alternative methods for CAPTCHA/spambot prevention in Different kinds of CAPTCHA.

Josh Clark recently posted Seven Habits of Highly Effective Spambot Hunters which gives even more good methods for preventing spam.

But with spambots gaining more and more features, what can we do to effectively prevent them, while still keeping our methods usable by most people and easy to code for us?

General usability issues

Josh's post shows some smart ways for preventing spam, but some of them require the user's browser to support JavaScript. While a small percentage of users these days, the people who surf with browsers without JS support or who disable JS, have an equal right to post comments to my or anyone's blog at least in my opinion.

While good for most browsers, hiding fake form fields with CSS provides a small problem for users with minority browsers, not all of which support CSS. Adding “Please leave this field empty” or something near the field is ok, but it's not the best possible way. At least I would think it would be kind of weird if I was asked to leave a field empty if I did not know what it probably is for.

Issues from semantic HTML

Writing semantic HTML gains popularity these days. This brings some issues to spam prevention: while obfuscating form field names, like Josh mentions, is a good way to confuse spambots, if the spambots are programmed to understand “semantic forms”, the whole obfuscation of field names loses its point. The spambot can just look at the field's label's contents, which would tell it what the field is for.

Programmer's point of view

While useful, some methods like using Akismet or obfuscating form fields with timestamps can take a while to get working properly. For bigger projects that might not be a problem, but you might not always want to spend much time on spam prevention when there are simpler ways to do it too.

Spambot tech

Spambots are gaining better and better methods for rendering pages. Some can even understand CSS and some JavaScript, so if you have a fake form field and its hidden with CSS display:none, the bot could detect it and leave it empty.

A spambot called XRunner has things like these in its feature list:

E-mail activation protection, meaning that the bot can automatically check your email for confirmation mails
Java-script protection. It can bypass some JavaScript-protected forms
A built-in proprietary “Question-answer” system. It talks with itself, making the spam look like real comments, from many IP's using proxies.
Software can perform registration at forums. It can register accounts on forums or blogs.

This is just one bot and if this one can do things like these, what can the rest do?

Answering all this

So what can you do to prevent spam, while keeping it usable, semantic and easy to code?

First, we need to find the best prevention methods that are also very simple to implement. Then we use all of them, combined to prevent spam.

A simple question (1+1=?, What comes after A in the alphabet, etc.)
Fake form fields
Header checks
Field value checks with obfuscated field names

The two first work quite well alone, so if we combine them, we should get a quite strong defense. Adding the last two server-side preventions, we should have a very solid spam-stopping wall. They are also user friendly and simple to code.

I mentioned the first three in my previous post about CAPTCHA, so check it for some more details.

The last one can be very useful: If you're not giving obvious clues to the spambots on what the fields contents should be, it might randomly fill them with whatever values it has been given, or in the best case, it could completely ignore the form.

When the bot doesn't know which values to fill in which fields, it may fill an URL in the name field or some other funny stuff. This is when you step in on the server-side scripts and check the fields.
Check for invalid values, too long values, URLs, email addresses, BB code or HTML in fields which should not accept them. If you spot invalid data in a field, you can completely ignore the data and not post it. There's always a possiblity that a user accidentally filled in bad data, so you could also redirect them back to the form and ask them to refill the fields with proper data. Spambots won't understand spoken english instructions that well, so it should be safe to do that.

With these simple to implement features, your forms should be relatively spam-free. They should also be quite simple to turn into an easy to use class/library for your server-side language of choice for future usage in other projects.