Spiders that bypass robots.txt

Discussion in 'Managing Your Online Community' started by valdet, Oct 23, 2013.

  1. valdet

    valdet Regular Member

    Joined:
    Oct 18, 2013
    Messages:
    22
    Likes Received:
    13
    Location:
    Kosovo
    As you know there are many spiders that "will not respect " robots.txt directives. BoardReader and BoardTracker being the more obvious examples.

    Earlier this summer I noticed a huge amount of my bandwidth was being taken by a certain bot Magpie. After some research I found that many people were having issues with it because although they claim to not crawl your site if you robots.txt them out, they will still crawl your site like mad hyenas.
    Code:
    http://www.brandwatch.com/magpie-crawler/
    If you check your server logs, you will probably notice their bots.
    Their main identifiers are Magpie and magpie-crawler/1.1
    They also identify with brandwatch.com and brandwatch.net

    They're so bad, that at a certain point, they caused a mini DDOS for about 30 minutes and after detailed checks I was shocked to see they had consumed up to 32% of total monthly bandwidth.

    I'm interested to know how you dealt with these types of bots?
    Since they're ignoring your robots.txt, I block them on HTTP header level.
     
    Last edited: Oct 23, 2013
  2. Code Monkey

    Code Monkey Regular Member

    Joined:
    Apr 15, 2013
    Messages:
    230
    Likes Received:
    170
    Linux Firewall. Drop
     
    Brandon likes this.
  3. valdet

    valdet Regular Member

    Joined:
    Oct 18, 2013
    Messages:
    22
    Likes Received:
    13
    Location:
    Kosovo
    @Code Monkey , do you mean to block them at root level based on their IPs.
    Can you expand a little bit on that please ?
     
  4. Code Monkey

    Code Monkey Regular Member

    Joined:
    Apr 15, 2013
    Messages:
    230
    Likes Received:
    170
    Not really much to expand. You use linux firewall and choose to drop those IP ranges. After that they will start crawling you with other ranges then you add those. Then they will try so a series of IP's they have in Holland and unfortunately you have to drop those.

    Code:
        Drop    If source is 110.89.11.0/24       
        Drop    If source is 120.43.10.0/24       
        Drop    If source is 110.89.9.0/24       
        Drop    If source is 5.10.83.0/25   
        Drop    If source is 202.46.0.0/16       
        Drop    If source is 220.181.0.0/16       
        Drop    If source is 180.76.5.0/24       
        Drop    If source is 185.10.104.0/24       
        Drop    If source is 123.125.71.0/24       
        Drop    If source is 180.76.0.0/16        
     
    Brandon likes this.
  5. Cerberus

    Cerberus Admin Talk Staff

    Joined:
    May 3, 2009
    Messages:
    1,031
    Likes Received:
    500
    http://moz.com/ugc/blocking-bots-based-on-useragent

    Also, a viable option. Very easy to block them via useragent. Also, this is a good way to block programs that are looking for exploits. There is a list somewhere also of all the programs like forumrunner and such that use modified useragents. But, yeah Code Monkey's idea is good too :)
     
    Brandon likes this.
  6. valdet

    valdet Regular Member

    Joined:
    Oct 18, 2013
    Messages:
    22
    Likes Received:
    13
    Location:
    Kosovo
    The idea of blocking IP's on root level is fine, but that goes out of the window when such spiders change their IPs each 3-4 days.

    I found that regardless of their IP or origin, when you block them based on their useragent string, you stand a better chance in 403-ing them.

    This month, with all their attempts to crawl, and tons of 403's served to them, I noticed they barely go above 1% of overall bandwidth.
     
  7. Caddyman

    Caddyman engiwebmastechanic

    Joined:
    Sep 12, 2013
    Messages:
    63
    Likes Received:
    36
    Location:
    Delaware
    I have a tag along question.

    Does the size, file size, line numbers, etc, of htaccess affect site/server performance?

    If I use like 100 lines to block various ip's, I'd that going to show pages?
     
  8. Code Monkey

    Code Monkey Regular Member

    Joined:
    Apr 15, 2013
    Messages:
    230
    Likes Received:
    170
    That doesn't work on Baidu as they will just start scraping your without a user agent using IP's from China Telecom and such. Frankly I am just tempted to block China periods since I don't serve them anyway and I don't want to help them in their quest to conquer the world.
     
  9. Code Monkey

    Code Monkey Regular Member

    Joined:
    Apr 15, 2013
    Messages:
    230
    Likes Received:
    170
    If you have access to apache then it's best to put them in your sites conf file.
     

Share This Page