With Amazon now imposing usage restrictions on their API data feed, it becomes more important than ever to make sure you’re not wasting resources on unwanted visitors to your sites. By ‘unwanted’ of course, we are referring to search engine spiders who comb your site, indexing hundreds (if not thousands) of pages, and give you nothing in return – no page rank, no shoppers, no income. As well as eating up your product request allocations, bad spiders (also known as bots) waste your bandwidth, drive up your allotted CPU time, and could force your site to go dark, resulting in no chance for sales.
This tutorial will outline a simple method to create a ‘Bot Trap’, which will allow your sites to automatically ban bad bots. This trap was originally described here, but has been modded to ignore the most popular (and useful) bots, such as Google, Yahoo and MSN/Bing.
The basic outline is this: When a bot lands on your site, it’s supposed to read the robots.txt file to know what it can and cannot index. It then goes through your site, spidering links it finds. If a bot ignores the directives in the robots.txt file, it eventually finds a ‘trap’ file that extracts the IP from the bot, and puts it on a banned list in your .htaccess file. This then denies the bot access to your site. A spider that obeys the robots.txt file does not scan the trap file, and is therefore not banned.
In order for this to work, you will need the following files in your site directory:
- robots.txt – The file that contains the list of files and/or directories you do not want scanned. You can exclude some or all bots from some or all files. A basic overview of this file can be read here.
- .htaccess – This file will hold the list of banned IPs. It’s the same file used by AOM for the mod_rewrite rulesets, as well as the directoryindex command. This file is only available on Linux/Unix servers.
- bad-bots.php – You can get a copy of this file below. It’s the ‘trap’ file that will extract the IP from the bot and pass it on to the .htaccess file. It’s often called a honeypot file.
Here is the code for the bad-bots.php file. Download and unzip.
If you want to send yourself an email every time the trap catches something, paste the following code into the file, just before the final ?>:
$subject = 'bad-bots';
$email = 'your_email@your_site.com'; //edit accordingly
$to = $email;
$message ='ip: ' . $ip . "\r\n" .
'user-agent string: ' . $agent . "\r\n" .
'requested url: ' . $request . "\r\n" .
'referer: ' . $referer . "\r\n"; // often is blank
$message = wordwrap($message, 70);
$headers = 'From: ' . $email . "\r\n" .
'Reply-To: ' . $email . "\r\n" .
'X-Mailer PHP/' . phpversion();
mail($to, $subject, $message, $headers);
Make sure to change your_email@your_site.com to your actual email address.
Next, make sure you have a robots.txt file set up, telling all spiders to avoid the bad-bots.php file:
Since bad bots tend to ignore robots.txt, they will happily disregard this bit of information.
Finally, put a link to the bad-bots.php file somewhere in your site, such as the Site Head section under the Site tab of your AOM control panel:
You can put a link almost anywhere – the Disclaimer or Powered By areas, in a custom box, etc. You can also insert it into a custom header or footer file, if you use those. A good examples of a link that the bots can find, but customers won’t, would be:
<a href=”bad-bots.php” style=”display: none;”>jangle-tacks</a>
You should now have:
- The bad-bots.php file installed on your site
- A robots.txt entry, telling bots not to index the bad-bots.php file
- A customer-hidden link in your AOM site to the bad-bots.php file
At this point, the trap should be ready to go. Bad bots will ignore the warning, find the link and index it. The bad-bots file will add the IP address of the spider to the .htaccess, with instructions to deny it access in the future. So the next time the bot comes to your site, it will not be able to enter. Over time, you may build up a substantial list, as many bots use more than one IP address. Eventually you should notice a decrease in bot traffic, as more and more of them wind up beating their heads against your .htaccess. If you can access your website error logs, you’ll start to see comments like “Denied access due to server configuration”. This means a possibly malicious spider has been blocked.
Congratulations – your bot trap is operational.
A few final notes:
If you attempt to view the bad-bots.php file in your browser, you’ll probably wind up getting your own IP blocked. If so, you’ll need to go into your .htaccess file and remove your IP.
There are bots that do obey the robots.txt file, and still manage to trash your server resources. You will need to find them (with sites using cPanel, you may be able to use AWStats to determine the IP of any bots viewing excessive numbers of pages), identify the IP to determine where the bot came from, and either contact them to have your sites blocked, or block them yourself. Also check your Error Log for large numbers of ‘File does not exist’ comments from the same IP over and over again. This could be a bot trying to find a way into your system. Some sites have an ‘IP Deny’ option with cPanel, or you can manually add the IP to the ban list in your .htaccess file.
Legitimate bots such as those from Google, Yahoo & MSN may gobble up a lot of pages. You can request they scan your site less often, or use a slower crawl rate. With Google this can be done via their Webmaster Tools. Yahoo has something similar in their Site Explorer, etc.