Spiders that bypass robots.txt

valdet · Oct 23, 2013

As you know there are many spiders that "will not respect " robots.txt directives. BoardReader and BoardTracker being the more obvious examples.

Earlier this summer I noticed a huge amount of my bandwidth was being taken by a certain bot Magpie. After some research I found that many people were having issues with it because although they claim to not crawl your site if you robots.txt them out, they will still crawl your site like mad hyenas.
Code:
http://www.brandwatch.com/magpie-crawler/
If you check your server logs, you will probably notice their bots.
Their main identifiers are Magpie and magpie-crawler/1.1
They also identify with brandwatch.com and brandwatch.net

They're so bad, that at a certain point, they caused a mini DDOS for about 30 minutes and after detailed checks I was shocked to see they had consumed up to 32% of total monthly bandwidth.

I'm interested to know how you dealt with these types of bots?
Since they're ignoring your robots.txt, I block them on HTTP header level.

Code Monkey · Oct 23, 2013

Linux Firewall. Drop

valdet · Oct 29, 2013

Code Monkey said: ↑

Linux Firewall. Drop
Click to expand...

@Code Monkey , do you mean to block them at root level based on their IPs.
Can you expand a little bit on that please ?

Code Monkey · Oct 29, 2013

Not really much to expand. You use linux firewall and choose to drop those IP ranges. After that they will start crawling you with other ranges then you add those. Then they will try so a series of IP's they have in Holland and unfortunately you have to drop those.
Code:
    Drop    If source is 110.89.11.0/24       
    Drop    If source is 120.43.10.0/24       
    Drop    If source is 110.89.9.0/24       
    Drop    If source is 5.10.83.0/25   
    Drop    If source is 202.46.0.0/16       
    Drop    If source is 220.181.0.0/16       
    Drop    If source is 180.76.5.0/24       
    Drop    If source is 185.10.104.0/24       
    Drop    If source is 123.125.71.0/24       
    Drop    If source is 180.76.0.0/16        

Cerberus · Oct 30, 2013

http://moz.com/ugc/blocking-bots-based-on-useragent

Also, a viable option. Very easy to block them via useragent. Also, this is a good way to block programs that are looking for exploits. There is a list somewhere also of all the programs like forumrunner and such that use modified useragents. But, yeah Code Monkey's idea is good too

valdet · Oct 31, 2013

Cerberus said: ↑

http://moz.com/ugc/blocking-bots-based-on-useragent

Also, a viable option. Very easy to block them via useragent. Also, this is a good way to block programs that are looking for exploits. There is a list somewhere also of all the programs like forumrunner and such that use modified useragents. But, yeah Code Monkey's idea is good too
Click to expand...

The idea of blocking IP's on root level is fine, but that goes out of the window when such spiders change their IPs each 3-4 days.

I found that regardless of their IP or origin, when you block them based on their useragent string, you stand a better chance in 403-ing them.

This month, with all their attempts to crawl, and tons of 403's served to them, I noticed they barely go above 1% of overall bandwidth.

Caddyman · Oct 31, 2013

I have a tag along question.

Does the size, file size, line numbers, etc, of htaccess affect site/server performance?

If I use like 100 lines to block various ip's, I'd that going to show pages?

Code Monkey · Nov 2, 2013

That doesn't work on Baidu as they will just start scraping your without a user agent using IP's from China Telecom and such. Frankly I am just tempted to block China periods since I don't serve them anyway and I don't want to help them in their quest to conquer the world.

Code Monkey · Nov 2, 2013

Caddyman said: ↑

I have a tag along question.

Does the size, file size, line numbers, etc, of htaccess affect site/server performance?

If I use like 100 lines to block various ip's, I'd that going to show pages?
Click to expand...

If you have access to apache then it's best to put them in your sites conf file.

Log in or Sign up

Spiders that bypass robots.txt

valdet Regular Member

Code Monkey Regular Member

valdet Regular Member

Code Monkey Regular Member

Cerberus Admin Talk Staff

valdet Regular Member

Caddyman engiwebmastechanic

Code Monkey Regular Member

Code Monkey Regular Member

Share This Page

Log in or Sign up

Spiders that bypass robots.txt

valdet Regular Member

Code Monkey Regular Member

valdet Regular Member

Code Monkey Regular Member

Cerberus Admin Talk Staff

valdet Regular Member

Caddyman engiwebmastechanic

Code Monkey Regular Member

Code Monkey Regular Member

Share This Page

Useful Searches