Close Panel

16

Apr

2009

MSNbot 2.0b is ignoring robots.txt and No Index meta tags

By Chewie. Posted in SEO/SEM | 24 Comments »

Back in December 2008 Microsoft Live Search announced that it would be releasing its new spider/crawler into the wild to crawl all those lovely websites out there, now 4 months on it seems that the MSNbot is being very naughty and and completely disregarding robots.txt and no index meta tags, and even worse, could be crawling your site based on the robots.txt of a completely different domain!

msn_logo

So what’s exactly going on? It seems that the problem first started in February 2009 when some users on webmaster world noticed that the new MSNbot had been hitting on their robots.txt files but not obeying the rules and grabbing pages which had been excluded. Discussion ensued with people wondering if this was just some crawler spoofing as MSNbot, but it turns out that it was the real MSNbot so why would it be completely disregarding the robots.txt?

Well another discussion over at Webmaster talk confirmed that MSNbot was definitely disregarding the robots.txt instructions, in fact one member posted the following information…

65.55.106.115 - [01.11] "GET /robots.txt "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.115 - [01.11] "GET /about.php "msnbot/2.0b"
65.55.106.172 - [01.16] "GET /forbidden/ "msnbot/2.0b"

Now for the non technical out there, the above is basically three lines from a log file which shows that MSNbot came to the site from the ip of 65.55.106.115 and read the robots.txt file, the bot then requested the about.php page and left. However, shortly after, the MSNbot came back from a different ip address (this time 65.55.106.172) and tried to crawl the /forbidden directory. Whats weird here is that apparently the /forbidden directory is not linked to from anywhere so the only way the bot would know it existed is by reading and disregarding the robots.txt file. It might cross your mind to think that this is all a coincidence and that someone masquerading as MSNbot came along shortly after and tried to access /forbidden, however both ip address belong to Microsoft.

As i said earlier, it seems a bit strange that Microsoft would start to ignore robots.txt files, so after digging deeper it seems like there is a bug in the new MSNbot which means that it is actually reading the robots.txt on a complete different domain and then trying spider your site. Here is an example request from the spider…

GET /robots.txt HTTP/1.1
Accept: */*
Host: www.lumigan.com
User-Agent: msnbot/2.0b (+http://search.msn.com/msnbot.htm)
Connection: Keep-Alive
Cache-Control: no-cache
Pragma: no-cache

In this instance, the spider thinks that it is crawling www.lumigan.com but is in fact crawling a completely different website thus disregarding it’s robots.txt and indexing pages that shouldn’t be indexed. It’s at this point that Microsoft seemed to get wind of it and stated that they are looking into the problem.

bad_robot

The final piece of the puzzle comes from a post on one of Microsoft’s own social boards, where a user basically confirms what everyone else has been speculating…

For some reason, msnbot/2.0b is visiting the wrong IP addresses to retrieve robots.txt. In other words, it THINKS it is getting robots.txt for www.yoursite.com, but it is really reading the robots.txt file that is served for the default host at the IP address for www.mysite.com (not necessarily www.mysite.com’s robots.txt). Clearly, msnbot/2.0b is using the wrong DNS lookup for its requests.

So, we get confirmation that MSNbot is using the wrong DNS lookup for its requests and as such is definitely crawling sites based on the wrong robots.txt information. This is very concerning since areas on your website that you specifically do not want to be crawled, are being crawled and could end up being placed in to the Live SERPS.

Thankfully Brett from MSN yet again confirms that they are aware of the problem and they are trying to fix it. The problem is, no one seems to know when the fix will be complete or if the data that they have gathered in the past 4 months has already been used in the SERPS.

If you want to check to see if your site has been effected then i offer you the following advice from the above forum post…

Search your web log for requests from msnbot/2.0b. Do you see requests for links that don’t exist on your site? That’s because they exist on a different site, the one msnbot/2.0b THINKS it’s crawling . If you log the requested server name, do you see unfamilar hosts? Those are the ones msnbot/2.0b THINKS it’s visiting .

You could also just out right ban the MSNbot using an .htaccess line with something similar to the following…

RewriteCond %{HTTP_REFERER} ^msnbot/2\.0b [NC]
RewriteRule .* - [F,L]

//Returns a 403-Forbidden response and no content.

Hopefully Microsoft can get this issue resolved soon.

About the Author:

Chewie is from a mysterious part of the United Kingdom called Up Norf. He has been working in web development and SEO for over ten years, beginning as a developer and moving to SEO in search of the perfect rank. As well as SEO he likes football, beer, girls and gravy - often at the same time.
Email this author | All posts by | Subscribe to Entries (RSS)

 

24 Responses to “MSNbot 2.0b is ignoring robots.txt and No Index meta tags”

  1. 1
    phaithful Says:

    “Clearly, msnbot/2.0b is using the wrong DNS lookup for its requests.”

    What kind of DNS server completely mis-maps the hostname and the IP address? msnbot/2.0b might be hitting a development DNS server? but that’s just a strange practice.

  2. 2
    MSNbot 2.0b – ignoriert Robots.txt und noindex tag Says:

    […] MSNbot 2.0b is ignoring robots.txt and No Index meta tags | Chewie.co.uk – Now with 100% less Wookie… Tags: Live Search, Meta Tags, MSNbot, Robots.txt Microsoft, Windows Live RSS-Feed Trackback […]

  3. Chewie
    3
    Chewie Says:

    phaithful: Yeah agree, it is very strange. The thing is, no one will know whats really going on until MS tell us. I don’t expect that happening for some time, if ever.

  4. 4
    richardbaxterseo Says:

    Nice post dean, I’m tempted to go look at my server logs to see what the score is over at ‘gadget. Speak soon!

  5. 5
    Yura Says:

    I was actually going to ban MSN with robots.txt for spoiling my keyword traffic reports (by doing cloaking checks), but guess I’ll have to use .htaccess. Thanks for the tip.

  6. Chewie
    6
    Chewie Says:

    Richard: Hey man, thanks for the kind words. Let me know what you find in your logs.

    Yura: Well however you want to do it is fine, i showed the .htaccess example since it is a quick and easy way to do it.

  7. 7
    MSNbot using up large amounts of bandwidth Says:

    […] and having read an article last week about the same robot apparently ignoring robots.txt files (MSNbot 2.0b is ignoring robots.txt and No Index meta tags) I went looking in the raw […]

  8. 8
    JohnQPublic Says:

    Better yet. I’ve notice MSNbot tries to spider domains without a WWW entry in DNS or HTTP server online. Thus I see msnbot trying to spider the company’s SMTP server.

    I will give MSNBot this… it’s a persistent little pest.

  9. Chewie
    9
    Chewie Says:

    John: I had no idea that it was also spidering non www, jeez they have more problems than i thought.

  10. 10
    Graphic Designer Says:

    Naughty MSN Bot, bad boy!!!

  11. 11
    thorvaldaagaard Says:

    Have you notice a problem with the microsoft bot following a link line http://www.yoursite.con/default.php#someplacein
    Using this in a bowser it will stirp of the name tag, and request the document like http://www.yoursite.con/default.php, but the bot requests the entire url, resulting in a 404

  12. 12
    H. Van Droogenbroeck Says:

    I Have the same problema, i will try blocking the bot with yours lines y htaccess, later I will tell you the results.

    thanks for sharing.
    Nacho

  13. 13
    H. Van Droogenbroeck Says:

    I have the same problem, i will try your code in htaccess. tomorrow i will tell you about the results.

    sorry my bad english. thanks for sharing.
    Nacho

  14. 14
    Msnbot is bonkers! Says:

    […] not want to be crawled are being crawled! http://social.microsoft.com/Forums/e…f-a7dda16d8a15 http://www.chewie.co.uk/seosem/msnbo…dex-meta-tags/ Happy IDNet broadband and phone customer Ubuntu and Quirky Linux * Opera browser Reply […]

  15. 15
    Facebook Applications Says:

    Interesting to read this because before this i didn’t know this thanks

  16. 16
    Facebook Game Company Says:

    Interesting to read this because before this i didn’t know this post.

  17. 17
    iPhone Application Developer Says:

    Bad search engine for SEO point of view

  18. 18
    Facebook Developer Says:

    Why msn not approved the robot.txt and meta tags????.Any restriction in msn or Bing.Robot.txt file is so important for any search engine.

  19. 19
    SEO Dublin Says:

    Discovered MSN webcrawler is ignoring our robots.txt and has been for nearly a year.
    The time has gone for MSN, it's the retirement time for them.

  20. 20
    pressure cookers for canning Says:

    Yeaaaahhhh congratulations, what a great moment for you, so glad to have been a part of this.Keep writing continue. 

  21. 21
    fagor pressure cookers Says:

    Wow!hmm. . .Nice site, and a nice blog design. Oh, and very useful information as well.

  22. 22
    fagor pressure cookers Says:

    Really funny post, I think it’s safe to say that the same goes on with Facebooks status updates as well! People now think because you can post your every move that sharing TMI is ok…well it’s not!!!

  23. 23
    Software & Web App Development Says:

     Of,course! As well know, your article is very attractive,so i got a good knowledge as wells as nice entertainment from your blog. Thanks for share it.

  24. 24
    maid service Chicago Says:

    Ya!
    I liked it and enjoyed reading it. Keep sharing such important posts. I really appreciated this post.

  25.  

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>