Search

Build Your Own Bot Trap

Build Your Own Bot Trap

Stop bad bots!With Amazon now imposing usage restrictions on their API data feed, it becomes more important than ever to make sure you’re not wasting resources on unwanted visitors to your sites. By ‘unwanted’ of course, we are referring to search engine spiders who comb your site, indexing hundreds (if not thousands) of pages, and give you nothing in return – no page rank, no shoppers, no income. As well as eating up your product request allocations, bad spiders (also known as bots) waste your bandwidth, drive up your allotted CPU time, and could force your site to go dark, resulting in no chance for sales.

This tutorial will outline a simple method to create a ‘Bot Trap’, which will allow your sites to automatically ban bad bots. This trap was originally described here, but has been modded to ignore the most popular (and useful) bots, such as Google, Yahoo and MSN/Bing.

The basic outline is this: When a bot lands on your site, it’s supposed to read the robots.txt file to know what it can and cannot index. It then goes through your site, spidering links it finds. If a bot ignores the directives in the robots.txt file, it eventually finds a ‘trap’ file that extracts the IP from the bot, and puts it on a banned list in your .htaccess file. This then denies the bot access to your site. A spider that obeys the robots.txt file does not scan the trap file, and is therefore not banned.

In order for this to work, you will need the following files in your site directory:

  1. robots.txt – The file that contains the list of files and/or directories you do not want scanned. You can exclude some or all bots from some or all files. A basic overview of this file can be read here.
  2. .htaccess – This file will hold the list of banned IPs. It’s the same file used by AOM for the mod_rewrite rulesets, as well as the directoryindex command. This file is only available on Linux/Unix servers.
  3. bad-bots.php – You can get a copy of this file below. It’s the ‘trap’ file that will extract the IP from the bot and pass it on to the .htaccess file. It’s often called a honeypot file.

Here is the code for the bad-bots.php file. Download and unzip.

Bad-bots zip file

If you want to send yourself an email every time the trap catches something, paste the following code into the file, just before the final ?>:

 

$subject = 'bad-bots';
$email = 'your_email@your_site.com'; //edit accordingly
$to = $email;
$message ='ip: ' . $ip . "\r\n" .
'user-agent string: ' . $agent . "\r\n" .
'requested url: ' . $request . "\r\n" .
'referer: ' . $referer . "\r\n"; // often is blank

$message = wordwrap($message, 70);

$headers = 'From: ' . $email . "\r\n" .
'Reply-To: ' . $email . "\r\n" .
'X-Mailer PHP/' . phpversion();

mail($to, $subject, $message, $headers);

 

Make sure to change your_email@your_site.com to your actual email address.

Next, make sure you have a robots.txt file set up, telling all spiders to avoid the bad-bots.php file:

User-agent: *
Disallow: /bad-bots.php

Since bad bots tend to ignore robots.txt, they will happily disregard this bit of information.

Finally, put a link to the bad-bots.php file somewhere in your site, such as the Site Head section under the Site tab of your AOM control panel:

Where to add the link in your AOM control panel

You can put a link almost anywhere – the Disclaimer or Powered By areas, in a custom box, etc. You can also insert it into a custom header or footer file, if you use those. A good examples of a link that the bots can find, but customers won’t, would be:

<a href=”bad-bots.php” style=”display: none;”>jangle-tacks</a>

You should now have:

  • The bad-bots.php file installed on your site
  • A robots.txt entry, telling bots not to index the bad-bots.php file
  • A customer-hidden link in your AOM site to the bad-bots.php file

At this point, the trap should be ready to go. Bad bots will ignore the warning, find the link and index it. The bad-bots file will add the IP address of the spider to the .htaccess, with instructions to deny it access in the future. So the next time the bot comes to your site, it will not be able to enter. Over time, you may build up a substantial list, as many bots use more than one IP address. Eventually you should notice a decrease in bot traffic, as more and more of them wind up beating their heads against your .htaccess. If you can access your website error logs, you’ll start to see comments like “Denied access due to server configuration”. This means a possibly malicious spider has been blocked.

Congratulations – your bot trap is operational.

A few final notes:

If you attempt to view the bad-bots.php file in your browser, you’ll probably wind up getting your own IP blocked. If so, you’ll need to go into your .htaccess file and remove your IP.

There are bots that do obey the robots.txt file, and still manage to trash your server resources.  You will need to find them (with sites using cPanel, you may be able to use AWStats to determine the IP of any bots viewing excessive numbers of pages), identify the IP to determine where the bot came from, and either contact them to have your sites blocked, or block them yourself. Also check your Error Log for large numbers of ‘File does not exist’ comments from the same IP over and over again. This could be a bot trying to find a way into your system. Some sites have an ‘IP Deny’ option with cPanel, or you can manually add the IP to the ban list in your .htaccess file.

Legitimate bots such as those from Google, Yahoo & MSN may gobble up a lot of pages. You can request they scan your site less often, or use a slower crawl rate. With Google this can be done via their Webmaster Tools. Yahoo has something similar in their Site Explorer, etc.

Latest posts

17 thoughts on “Build Your Own Bot Trap

  1. moci

    I found this error message

    Parse error: syntax error, unexpected ‘:’ in

    when I tried to open the file on my browser.

  2. jeff

    looks like the code is missing something starting the php <?php … I am getting several errors. Trying to compare to the original code from seven-3-five blog site but still not getting it to work.

    Can you validate what is displayed above?

  3. Karl

    And what can we do for a bot trap on a site built on a Microsoft OS that lacks .htaccess?

  4. jmc

    Does it work with an OAM install in a sub-directory where there is one .htaccess in the main and one in the AOM install ?

  5. mcarp555

    Karl: Unfortunately I do not have any answer for you. The person who originated the bot trap (credited in the post) only provided a version for Linux servers.

    jmc: It should. You might need to experiment a bit to find the right setup for your server, but I don’t see any reason why it wouldn’t work.

  6. Works like a charm. Since I am comfortable with editing the “.htaccess” file, I was able to run a test and verify that it was banning IP’s that accessed this file.

  7. trolldude

    for first try maybe you should add error handling 🙂
    can be replaced if you found the errors.

    function add_badbot($text, $file_name) {
    if (is_writable($file_name)) {
    if (!$handle = fopen($file_name, “a”)) {
    print “Cant open $file_name”;
    exit;
    }
    if (!fwrite($handle, $text)) {
    print “Cant write to $file_name”;
    }
    print “Finished adding file $file_name following botip $text”;
    fclose($handle);
    } else {
    print “The file $file_name isnt writeable”;
    }
    }

  8. […] Installing a bot trap can help cut down on the waste, freeing up your site for humans to shop at. It’s not a magic bullet, but more and more it becomes a vital tool to regain control of your CPU usage. Information on how to set up a bot trap can be found here. […]

  9. Hendra

    I put this: jangle-tacks in site head, but the link is not hidden (customers can see this link)

    I tried to put in disclaimer, also is not hidden, customers still can see the link.

    I haven’t tried site header or footer, but I am not using site header or footer. Can I put jangle-tacks in there even though I am not using header or footer?

    Or can I change the word: jangle-tacks to dot (.) or a blank space?

    Please help…

    Thanks…

  10. mcarp555

    If you’re not using a custom header or footer, don’t use the ‘Site Header’ or ‘Site Footer’ areas.

    Try this:

    < div style="display:none;" >< a href="YOUR LINK TO THE BAD-BOTS FILE" >whatever< /a >< /div >

    Make sure you use the correct URL to your bad-bots file, and remove the spaces I added around the ‘greater than’ and ‘less than’ brackets.

  11. I never seem to have this bad-bots problem on my AOM site, but I have some WordPress sites running on a LinuxServer that I have had some bad-bots problems with. Will this code work on WordPress too?

  12. Yes; there’s nothing in the script that’s particular to AOM. It can be used for any site.

  13. file not working, i’am try to open bot trap page and not getting banned. Do I need to CMOD httaccess to 755?

  14. If you’ve set up the trap, then go to the page itself, you ban yourself. You’ll have to remove your own IP from the .htaccess file.

  15. Ian Shere

    I tried this on myself and it worked the first time. However, it refuses to now work again.

    I re-downloaded the bad-bots file and FTPd it to the site. Still nothing. Can’t see it’d be due to me accessing it from the same IP, but that’s all I can think of.

  16. Did you delete your IP from the .htaccess file? If so, how long ago?

Leave a Comment