Apache and PHP vs. the Spambots Continued
robots.txt
When a robot visits a website, it is supposed to check for a "robots.txt" file before doing anything else (at http://www.ahref.com/robots.txt for www.ahref.com, for example). The robots.txt file tells the robot what pages on the site it can access. If a robot disobeys the directives in robots.txt, either by not checking it in the first place or checking it then ignoring what it says, it's a bad robot.
To tell robots not to visit any pages on your site, it should say:
User-agent: *
Disallow: /
That is, all user-agents (*) are disallowed from visiting anything on the site (anything under /).
But you probably want some robots - for example, search engine robots - to traverse your site. To set a trap for a bad robot, put something like the following in your robots.txt file:
User-agent: *
Disallow: /int/
User-agent: *
Disallow: /inttoo/
This tells robots not to go into the /int/ or /inttoo/ sections on your website. (Choose another word if you actually have valid content in such a directory on your site.) So no good robots will go there.
You don't want normal users to go there, either; so don't put any obvious links to that directory on your web pages. But to lure in bad robots, put an invisible link on your front page (and possibly elsewhere), around a single-pixel transparent gif, leading to a page in the first disallowed directory:
<a href="/int/x.html"><img src="pixel.gif"
border="0"></a>
Normal users shouldn't go there, because the link is invisible; and good robots won't go, because it's disallowed. So anything that does follow the link will be a bad robot. Make sure that the page you link to in the disallowed directory is PHP-parsed (my server is set to parse .html files with PHP), because it's supposed to notify you of unwelcome visitors. Getting Robot Alerts From PHP
To get PHP to alert you when the trap page is hit, just put the following code in the page (substitute your own domain name for ???.com):
<?PHP
$TO = "webmaster@???.com";
$FROM = "robo-watch@???.com";
$SUBJECT = "???.com alert: bad robot";
$mess = "A bad robot hit $SCRIPT_URI.\n";
$mess .= "Address is $REMOTE_ADDR\n";
$mess .= "Agent is $HTTP_USER_AGENT\n";
mail($TO, $SUBJECT, $mess, "From: $FROM");
readfile("http://???.com/inttoo/y.html");
?>
As you can imagine, whenever someone visits the page, an email message will go to webmaster@???.com, with details on the IP address and user agent used. (Keep in mind that the spambot-writer determines what the name of the user agent is; often, they'll claim to be a "normal" browser, like "Mozilla/4.0 (compatible; MSIE 4.01; Windows 98)", when they're not.)
The readfile line is to give the robot some text to chew on and will start the neutralization process. Notice that it references a URL in the second directory that you declared disallowed in the robots.txt file. (If you used another directory name in robots.txt, use that directory name here.) Create the index.html file for that directory - you can put any text you want in there (maybe just another copy of your front page, minus its links).
So now you're getting notified whenever bad robots - including spambots - come to your site. Next, we neutralize them. |