Apache and PHP vs. the Spambots
09/22/2000
by Edward Piou
This article is about stopping bad robots - specifically, spambots. A "spambot" is a program that visits a website, collects as many email addresses as it can from the site, and (generally) moves on to another site, where it will collect more email addresses. People run these programs to get lists of addresses which they'll either spam themselves, or sell to other people to spam.
To stop spambots using the methods listed here, you'll need:
The methods can actually be pretty easily adapted for Perl instead of PHP. If you're using a webserver other than Apache, you'll need to figure out how to adapt the little tricks we use here for your own server software.
There are 2 steps involved in dealing with any spambots that come to your site:
I first started thinking about detecting spambots when I attended a tutorial given by Lincoln Stein, Perl guru. One of the things he talked about was figuring out which of the "users" visiting his website were actually robots, and which of those robots were "bad" robots - robots that didn't check his robots.txt file to determine which pages they weren't allowed to access.
Stein wrote a perl program, Robo-Cop, which went through his web server logs to identify robots and bad robots. The source code of the program is available online. Here is some sample output from a modified version of the program I used to run (don't worry, we'll get to PHP soon):
| Client | Robot | Hits | Interval | Hit_Percent | Index |
| 216.112.23.11: htdig/3.0.7 (andrew@contigo.com) | yes | 7954 | 0.65 | 9.03 | 13.91 |
| 10.0.0.1: AVSearch-3.0(EoExchange/Liberty) | yes | 2029 | 20.52 | 2.30 | 0.11 |
| 10.0.0.2: PLSpider/V1.0 | no | 1987 | 0.61 | 2.26 | 3.70 |
| 10.0.0.3: Microsoft Internet Explorer/4.40.426 (Windows 95) | no | 1311 | 7.16 | 1.49 | 0.21 |
| 10.0.0.4: Slurp/2.0-BigOwlWeekly (spider@aeneid.com; http://www.inktomi.com/slurp.html) | yes | 985 | 32.28 | 1.12 | 0.03 |
| 10.0.0.5: AVSearch-3.0(liberty/libertycrawl) | yes | 606 | 17.81 | 0.69 | 0.04 |
What the columns mean:
The program worked pretty well at first; but as I fed it larger server logs, it started taking too much memory and system resources (I was on a low-powered machine). Plus, if I wanted to actually do something about bad bots, I had to wait until the program ran every six hours (through a cron job), go to the output for the program, look for high Index values, and figure out if I wanted to neutralize the bot. This meant a spambot could be scouring my pages for 6 hours (or longer, if I didn't bother checking or was asleep) before I did anything about it.
Using PHP and robots.txt, I decided to be more proactive. To detect bad robots (almost) immediately, I would create web pages which neither people nor "good" robots would visit, and have my webserver send me email whenever anyone (presumably a bad robot) visited the page.
When a robot visits a website, it is supposed to check for a "robots.txt" file before doing anything else (at http://www.ahref.com/robots.txt for www.ahref.com, for example). The robots.txt file tells the robot what pages on the site it can access. If a robot disobeys the directives in robots.txt, either by not checking it in the first place or checking it then ignoring what it says, it's a bad robot.
To tell robots not to visit any pages on your site, it should say:
User-agent: *
Disallow: /
That is, all user-agents (*) are disallowed from visiting anything on the site (anything under /).
But you probably want some robots - for example, search engine robots - to traverse your site. To set a trap for a bad robot, put something like the following in your robots.txt file:
User-agent: *
Disallow: /int/
User-agent: *
Disallow: /inttoo/
This tells robots not to go into the /int/ or /inttoo/ sections on your website. (Choose another word if you actually have valid content in such a directory on your site.) So no good robots will go there.
You don't want normal users to go there, either; so don't put any obvious links to that directory on your web pages. But to lure in bad robots, put an invisible link on your front page (and possibly elsewhere), around a single-pixel transparent gif, leading to a page in the first disallowed directory:
<a href="/int/x.html"><img src="pixel.gif"
border="0"></a>
Normal users shouldn't go there, because the link is invisible; and good robots won't go, because it's disallowed. So anything that does follow the link will be a bad robot. Make sure that the page you link to in the disallowed directory is PHP-parsed (my server is set to parse .html files with PHP), because it's supposed to notify you of unwelcome visitors.
To get PHP to alert you when the trap page is hit, just put the following code in the page (substitute your own domain name for ???.com):
<?PHP
$TO = "webmaster@???.com";
$FROM = "robo-watch@???.com";
$SUBJECT = "???.com alert: bad robot";
$mess = "A bad robot hit $SCRIPT_URI.\n";
$mess .= "Address is $REMOTE_ADDR\n";
$mess .= "Agent is $HTTP_USER_AGENT\n";
mail($TO, $SUBJECT, $mess, "From: $FROM");
readfile("http://???.com/inttoo/y.html");
?>
As you can imagine, whenever someone visits the page, an email message will go to webmaster@???.com, with details on the IP address and user agent used. (Keep in mind that the spambot-writer determines what the name of the user agent is; often, they'll claim to be a "normal" browser, like "Mozilla/4.0 (compatible; MSIE 4.01; Windows 98)", when they're not.)
The readfile line is to give the robot some text to chew on and will start the neutralization process. Notice that it references a URL in the second directory that you declared disallowed in the robots.txt file. (If you used another directory name in robots.txt, use that directory name here.) Create the index.html file for that directory - you can put any text you want in there (maybe just another copy of your front page, minus its links).
So now you're getting notified whenever bad robots - including spambots - come to your site. Next, we neutralize them.
You have three choices at this point, two good, one bad:
Keeping the robot away from your content will take less of your machine's resources. Keeping it coming back for more pages - email-address-less pages - will use up your resources, but it will also use the spambot user's resources, which might make you feel good.
Feeding the robot fake email addresses is a bad idea, for two reasons. First - if you randomly generate fake email addresses, you might end up creating a real one by accident, which will be bad news for whoever owns that address. Second - spammers often use someone else's email address when sending spam; if you give the spammer 500 fake email addresses, someone else will probably get the bounces. Even if the spammer doesn't use a "valid" email address to spam from, someone's mail server will have to deal with all the bounces. Which isn't fun. So please - don't feed the spambot.
Here's where mod_rewrite comes in. Once you get the email message that a bad robot is at your site, look at the user agent that the robot is identifying itself as. If it has a distinctive user agent (something like "EmailWolf" or "WebBandit" rather than "Mozilla"), put the following lines in Apache's srm.conf file (or whichever Apache config file you like to keep such things in). This will send any users with that same user agent to /int/index.html, no matter what URL they try to access on your site:
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^BADUSERAGENT
RewriteRule ^.*$ DOCROOT/inttoo/index.html [L]
Be sure to substitute the bad robot's user agent for BADUSERAGENT and your website's document root for DOCROOT. A strange quirk I've noticed with mod_rewrite: it doesn't seem to like blank spaces in the value that RewriteCond looks for (the documentation for mod_rewrite doesn't use any blank spaces in example values, either). So if the user agent is "EmailWolf 4.0", don't use the full user agent on the RewriteCond line; just use everything before the blank space - "EmailWolf".
If the bad 'bot doesn't have a distinctive user agent, put this instead, to block the IP address the robot is coming from:
RewriteEngine on
RewriteCond %{REMOTE_ADDR} ^BADIPADDRESS
RewriteRule ^.*$ DOCROOT/inttoo/index.html [L]
Substitute the robot's IP address for BADIPADRESS. If you have more than one type of robot you want to fool, your config file will look something like this:
RewriteEngine on
RewriteCond %{REMOTE_ADDR} ^BADIPADDRESS1 [OR]
RewriteCond %{REMOTE_ADDR} ^BADIPADDRESS2 [OR]
RewriteCond %{HTTP_USER_AGENT} ^BADUSERAGENT1 [OR]
RewriteCond %{HTTP_USER_AGENT} ^BADUSERAGENT2
RewriteRule ^.*$ DOCROOT/inttoo/index.html [L]
This will send any requests from the IP addresses BADIPADDRESS1 or BADIPADDRESS2, or user agents BADUSERAGENT1 or BADUSERAGENT2, to /inttoo/index.html. (By the way: you'll need to restart Apache for the changes to the configuration files to "take.")
If you want to use up the spambot user's resources (and, unfortunately, your own), read on.
If you don't mind getting a few hundred or thousand extra hits from bad robots, then instead of creating the inttoo directory and an index.html file for it, create a file called inttoo in your top document directory (so it's accessible at http://www.???.com/inttoo) and put the following text in it:
<html>
<head><title>howdy</title></head>
<body>
<?PHP
/* This program generates a random series of URLs
to waste bad robots' time */
/* prep for random number generation;
number of links to generate is 6 to 10;
we'll force the robot to wait 10-20 seconds;
we'll have 30 random words on the page, too */
srand (mktime ());
$maxer = getrandmax();
$numlinks = 6 + (1.0 * rand () / $maxer) * 4;
$numwords = 30;
$sleep_delay = 10 + (1.0 * rand () / $maxer) * 10;
/* Set the dictionary file to a file with a
line-delimited series of words, each on one
line. My /usr/dict/words file is 45,000 words
long; you should probably copy just a thousand
words into another file and use that file. */
$dictionary_file = "/usr/dict/words";
$wlist = file ($dictionary_file);
/* generates some random non-linked words,
so not everything on the page is a link,
which is something bots might look out for */
for ($wcount = 0; $wcount < $numwords; $wcount++) {
$rcount = (1.0 * rand () / $maxer) * sizeof ($wlist);
$word = $wlist[$rcount];
print "<br>$word ";
}
sleep ($sleep_delay);
/* base_url is the directory which was disallowed
in robots.txt. this generates a bunch of
random links, all into that disallowed directory */
$base_url = "/inttoo/";
for ($wcount = 0; $wcount < $numlinks; $wcount++) {
$rcount = (1.0 * rand () / $maxer) * sizeof ($wlist);
$word = $wlist[$rcount];
print "<br><a href=\"$base_url$word\">$word</a>\n ";
}
?>
</body>
</html>
If you have multiple virtual hosts on one server, you'll need to have a copy in each document trees, or to create one copy and link to it from each document tree.
Last, but not least, add the following lines to your Apache httpd.conf file (I'm still using PHP3; change x-httpd-php3 if you're not):
<Location /inttoo>
ForceType application/x-httpd-php3
</Location>
and change the line:
RewriteRule ^.*$ DOCROOT/inttoo/index.html [L]
to:
RewriteRule ^.*$ DOCROOT/inttoo
[L,T=application/x-httpd-php3]
in the srm.conf file (or wherever you put it). That should all be one line, by the way. Again, if you're using PHP4, change the line appropriately.
This will force any calls to URLs under http://www.???.com/inttoo/ to just call the program inttoo. Any robot following links in that URL-space will just keep getting randomly-generated pages, each taking 10-20 seconds to load, without any email addresses on the pages.
So now you know how to stop a spambot. You can either force it to download the same useless page each time it tries to get something from your site, or have it run through several hundred pages of links to nowhere.
If you're intent on feeding fake email addresses to the spammers, you can get Wpoison, a Perl program, and either port it to PHP or run it as-is.
You should also keep in mind that using PHP and Apache in the way I describe here isn't foolproof. If a spambot obeys the robots exclusion protocol - that is, they read your robots.txt file, and stay out of certain directories, as you direct - they won't be caught by this method; they'll still look for email addresses in allowed directories. What you should do, then, is keep robots out of sections of your site which have lots of email addresses (bulletin boards, guest books, etc.). Using the robots exclusion protocol, you can actually let specific good robots (from legitimate search engines) into those areas, and keep the rest out. You can also use the Robo-Cop program I talked about earlier as a back-up to detect robots that obey your robots.txt file, but which you'd still like to keep off your site.
For more spam-related resources, see the Spam section of our Web Index. To hear about new content on ahref.com, sign up for our newsletter. And don't forget about our survey on dealing with spam...
Edward Piou is an ahref.com producer and runs ep Productions, Inc., a development company based in the Washington, D.C. area.
This site copyright 1998-1999 ep Productions, Inc. Text of any articles is copyright of the author.