PLEASE note: These pages are here solely for historic purposes. New articles have not been written since 2001; many links in the index are broken; and most ahref.com email addresses will now bounce. Try visiting ep Productions, Incorporated, the web programming and development company behind this site.

Tip: Talk with other developers in the discussion forums.

web index ahref.com: a community space for web developers------ -----
IndexToolsCareersTalk
ahref.com > Guides > Technology
Technology Guide

Apache and PHP vs. the Spambots Continued

You have three choices at this point, two good, one bad:

  1. just keep the robot away from your content
  2. keep the robot away from your content, but keep it coming back for more
  3. feed it fake email addresses

Keeping the robot away from your content will take less of your machine's resources. Keeping it coming back for more pages - email-address-less pages - will use up your resources, but it will also use the spambot user's resources, which might make you feel good.

Feeding the robot fake email addresses is a bad idea, for two reasons. First - if you randomly generate fake email addresses, you might end up creating a real one by accident, which will be bad news for whoever owns that address. Second - spammers often use someone else's email address when sending spam; if you give the spammer 500 fake email addresses, someone else will probably get the bounces. Even if the spammer doesn't use a "valid" email address to spam from, someone's mail server will have to deal with all the bounces. Which isn't fun. So please - don't feed the spambot.

Keeping the Spambot Away

Here's where mod_rewrite comes in. Once you get the email message that a bad robot is at your site, look at the user agent that the robot is identifying itself as. If it has a distinctive user agent (something like "EmailWolf" or "WebBandit" rather than "Mozilla"), put the following lines in Apache's srm.conf file (or whichever Apache config file you like to keep such things in). This will send any users with that same user agent to /int/index.html, no matter what URL they try to access on your site:


  RewriteEngine  on
  RewriteCond %{HTTP_USER_AGENT} ^BADUSERAGENT
  RewriteRule ^.*$ DOCROOT/inttoo/index.html [L]

Be sure to substitute the bad robot's user agent for BADUSERAGENT and your website's document root for DOCROOT. A strange quirk I've noticed with mod_rewrite: it doesn't seem to like blank spaces in the value that RewriteCond looks for (the documentation for mod_rewrite doesn't use any blank spaces in example values, either). So if the user agent is "EmailWolf 4.0", don't use the full user agent on the RewriteCond line; just use everything before the blank space - "EmailWolf".

If the bad 'bot doesn't have a distinctive user agent, put this instead, to block the IP address the robot is coming from:


  RewriteEngine  on
  RewriteCond %{REMOTE_ADDR} ^BADIPADDRESS
  RewriteRule ^.*$ DOCROOT/inttoo/index.html [L]

Substitute the robot's IP address for BADIPADRESS. If you have more than one type of robot you want to fool, your config file will look something like this:


  RewriteEngine  on
  RewriteCond %{REMOTE_ADDR} ^BADIPADDRESS1 [OR]
  RewriteCond %{REMOTE_ADDR} ^BADIPADDRESS2 [OR]
  RewriteCond %{HTTP_USER_AGENT} ^BADUSERAGENT1 [OR]
  RewriteCond %{HTTP_USER_AGENT} ^BADUSERAGENT2
  RewriteRule ^.*$ DOCROOT/inttoo/index.html [L]

This will send any requests from the IP addresses BADIPADDRESS1 or BADIPADDRESS2, or user agents BADUSERAGENT1 or BADUSERAGENT2, to /inttoo/index.html. (By the way: you'll need to restart Apache for the changes to the configuration files to "take.")

If you want to use up the spambot user's resources (and, unfortunately, your own), read on.

Keep That Spambot Comin'

If you don't mind getting a few hundred or thousand extra hits from bad robots, then instead of creating the inttoo directory and an index.html file for it, create a file called inttoo in your top document directory (so it's accessible at http://www.???.com/inttoo) and put the following text in it:


  <html>
  <head><title>howdy</title></head>
  <body>
  <?PHP
  /* This program generates a random series of URLs
  to waste bad robots' time */

  /* prep for random number generation;
  number of links to generate is 6 to 10;
  we'll force the robot to wait 10-20 seconds;
  we'll have 30 random words on the page, too */

  srand (mktime ());
  $maxer = getrandmax();
  $numlinks = 6 + (1.0 * rand () / $maxer) * 4;
  $numwords = 30;
  $sleep_delay = 10 + (1.0 * rand () / $maxer) * 10;

  /* Set the dictionary file to a file with a
  line-delimited series of words, each on one
  line. My /usr/dict/words file is 45,000 words
  long; you should probably copy just a thousand
  words into another file and use that file. */

  $dictionary_file = "/usr/dict/words";
  $wlist = file ($dictionary_file);
  
  /* generates some random non-linked words,
  so not everything on the page is a link,
  which is something bots might look out for */

  for ($wcount = 0; $wcount < $numwords; $wcount++) {
    $rcount = (1.0 * rand () / $maxer) * sizeof ($wlist);
    $word = $wlist[$rcount];
    print "<br>$word ";
  }
  sleep ($sleep_delay);

  /* base_url is the directory which was disallowed
  in robots.txt. this generates a bunch of
  random links, all into that disallowed directory */

  $base_url = "/inttoo/";
  for ($wcount = 0; $wcount < $numlinks; $wcount++) {
    $rcount = (1.0 * rand () / $maxer) * sizeof ($wlist);
    $word = $wlist[$rcount];
    print "<br><a href=\"$base_url$word\">$word</a>\n ";
  }
  ?>
  </body>
  </html>


If you have multiple virtual hosts on one server, you'll need to have a copy in each document trees, or to create one copy and link to it from each document tree.

Last, but not least, add the following lines to your Apache httpd.conf file (I'm still using PHP3; change x-httpd-php3 if you're not):


  <Location /inttoo>
  ForceType application/x-httpd-php3
  </Location>

and change the line:

  RewriteRule ^.*$ DOCROOT/inttoo/index.html [L]

to:

  RewriteRule ^.*$ DOCROOT/inttoo
    [L,T=application/x-httpd-php3]

in the srm.conf file (or wherever you put it). That should all be one line, by the way. Again, if you're using PHP4, change the line appropriately.

This will force any calls to URLs under http://www.???.com/inttoo/ to just call the program inttoo. Any robot following links in that URL-space will just keep getting randomly-generated pages, each taking 10-20 seconds to load, without any email addresses on the pages.

continue reading >>>
or jump to a topic:

Detecting a Spambot
Detecting a Spambot Faster
Neutralizing the Spambot
Discussion and Resources

view a printable version of this article


To suggest a topic, please email info@ahref.com.

 


HOME ||| ABOUT AHREF.COM ||| ADVERTISE ||| FEEDBACK ||| SEARCH THIS SITE ||| CONTRIBUTE

This site © 1998-1999 ep Productions, Inc. Text of any articles is copyright of the author. All rights reserved. Terms of use.