PLEASE note: These pages are here solely for historic purposes. New articles have not been written since 2001; many links in the index are broken; and most ahref.com email addresses will now bounce. Try visiting ep Productions, Incorporated, the web programming and development company behind this site.

Tip: Want to know when we post new content? Subscribe to our newsletter.

web index ahref.com: a community space for web developers------ -----
IndexToolsCareersTalk
ahref.com > Guides > Technology
Technology Guide

Apache and PHP vs. the Spambots Continued

robots.txt

When a robot visits a website, it is supposed to check for a "robots.txt" file before doing anything else (at http://www.ahref.com/robots.txt for www.ahref.com, for example). The robots.txt file tells the robot what pages on the site it can access. If a robot disobeys the directives in robots.txt, either by not checking it in the first place or checking it then ignoring what it says, it's a bad robot.

To tell robots not to visit any pages on your site, it should say:


  User-agent: *
  Disallow: /

That is, all user-agents (*) are disallowed from visiting anything on the site (anything under /).

But you probably want some robots - for example, search engine robots - to traverse your site. To set a trap for a bad robot, put something like the following in your robots.txt file:


  User-agent: *
  Disallow: /int/
  User-agent: *
  Disallow: /inttoo/

This tells robots not to go into the /int/ or /inttoo/ sections on your website. (Choose another word if you actually have valid content in such a directory on your site.) So no good robots will go there.

You don't want normal users to go there, either; so don't put any obvious links to that directory on your web pages. But to lure in bad robots, put an invisible link on your front page (and possibly elsewhere), around a single-pixel transparent gif, leading to a page in the first disallowed directory:


  <a href="/int/x.html"><img src="pixel.gif"
    border="0"></a>

Normal users shouldn't go there, because the link is invisible; and good robots won't go, because it's disallowed. So anything that does follow the link will be a bad robot. Make sure that the page you link to in the disallowed directory is PHP-parsed (my server is set to parse .html files with PHP), because it's supposed to notify you of unwelcome visitors.

Getting Robot Alerts From PHP

To get PHP to alert you when the trap page is hit, just put the following code in the page (substitute your own domain name for ???.com):


  <?PHP
      $TO = "webmaster@???.com";
      $FROM = "robo-watch@???.com";
      $SUBJECT = "???.com alert: bad robot";
      $mess = "A bad robot hit $SCRIPT_URI.\n";
      $mess .= "Address is $REMOTE_ADDR\n";
      $mess .= "Agent is $HTTP_USER_AGENT\n";
      mail($TO, $SUBJECT, $mess, "From: $FROM");
      readfile("http://???.com/inttoo/y.html");
    ?>

As you can imagine, whenever someone visits the page, an email message will go to webmaster@???.com, with details on the IP address and user agent used. (Keep in mind that the spambot-writer determines what the name of the user agent is; often, they'll claim to be a "normal" browser, like "Mozilla/4.0 (compatible; MSIE 4.01; Windows 98)", when they're not.)

The readfile line is to give the robot some text to chew on and will start the neutralization process. Notice that it references a URL in the second directory that you declared disallowed in the robots.txt file. (If you used another directory name in robots.txt, use that directory name here.) Create the index.html file for that directory - you can put any text you want in there (maybe just another copy of your front page, minus its links).

So now you're getting notified whenever bad robots - including spambots - come to your site. Next, we neutralize them.

continue reading >>>
or jump to a topic:

Detecting a Spambot
Detecting a Spambot Faster
Neutralizing the Spambot
Discussion and Resources

view a printable version of this article


To suggest a topic, please email info@ahref.com.

 


HOME ||| ABOUT AHREF.COM ||| ADVERTISE ||| FEEDBACK ||| SEARCH THIS SITE ||| CONTRIBUTE

This site © 1998-1999 ep Productions, Inc. Text of any articles is copyright of the author. All rights reserved. Terms of use.