As you know there are many spiders that "will not respect " robots.txt directives. BoardReader and BoardTracker being the more obvious examples. Earlier this summer I noticed a huge amount of my bandwidth was being taken by a certain bot Magpie. After some research I found that many people were having issues with it because although they claim to not crawl your site if you robots.txt them out, they will still crawl your site like mad hyenas. Code: http://www.brandwatch.com/magpie-crawler/ If you check your server logs, you will probably notice their bots. Their main identifiers are Magpie and magpie-crawler/1.1 They also identify with brandwatch.com and brandwatch.net They're so bad, that at a certain point, they caused a mini DDOS for about 30 minutes and after detailed checks I was shocked to see they had consumed up to 32% of total monthly bandwidth. I'm interested to know how you dealt with these types of bots? Since they're ignoring your robots.txt, I block them on HTTP header level.
@Code Monkey , do you mean to block them at root level based on their IPs. Can you expand a little bit on that please ?
Not really much to expand. You use linux firewall and choose to drop those IP ranges. After that they will start crawling you with other ranges then you add those. Then they will try so a series of IP's they have in Holland and unfortunately you have to drop those. Code: Drop If source is 110.89.11.0/24 Drop If source is 120.43.10.0/24 Drop If source is 110.89.9.0/24 Drop If source is 5.10.83.0/25 Drop If source is 202.46.0.0/16 Drop If source is 220.181.0.0/16 Drop If source is 180.76.5.0/24 Drop If source is 185.10.104.0/24 Drop If source is 123.125.71.0/24 Drop If source is 180.76.0.0/16
http://moz.com/ugc/blocking-bots-based-on-useragent Also, a viable option. Very easy to block them via useragent. Also, this is a good way to block programs that are looking for exploits. There is a list somewhere also of all the programs like forumrunner and such that use modified useragents. But, yeah Code Monkey's idea is good too
The idea of blocking IP's on root level is fine, but that goes out of the window when such spiders change their IPs each 3-4 days. I found that regardless of their IP or origin, when you block them based on their useragent string, you stand a better chance in 403-ing them. This month, with all their attempts to crawl, and tons of 403's served to them, I noticed they barely go above 1% of overall bandwidth.
I have a tag along question. Does the size, file size, line numbers, etc, of htaccess affect site/server performance? If I use like 100 lines to block various ip's, I'd that going to show pages?
That doesn't work on Baidu as they will just start scraping your without a user agent using IP's from China Telecom and such. Frankly I am just tempted to block China periods since I don't serve them anyway and I don't want to help them in their quest to conquer the world.