MSNbot 2.0b is ignoring robots.txt and No Index meta tags

Back in December 2008 Microsoft Live Search announced that it would be releasing its new spider/crawler into the wild to crawl all those lovely websites out there, now 4 months on it seems that the MSNbot is being very naughty and and completely disregarding robots.txt and no index meta tags, and even worse, could be crawling your site based on the robots.txt of a completely different domain!

msn_logo

So what’s exactly going on? It seems that the problem first started in February 2009 when some users on webmaster world noticed that the new MSNbot had been hitting on their robots.txt files but not obeying the rules and grabbing pages which had been excluded. Discussion ensued with people wondering if this was just some crawler spoofing as MSNbot, but it turns out that it was the real MSNbot so why would it be completely disregarding the robots.txt?

Well another discussion over at Webmaster talk confirmed that MSNbot was definitely disregarding the robots.txt instructions, in fact one member posted the following information…

65.55.106.115 - [01.11] "GET /robots.txt "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.115 - [01.11] "GET /about.php "msnbot/2.0b"
65.55.106.172 - [01.16] "GET /forbidden/ "msnbot/2.0b"

Now for the non technical out there, the above is basically three lines from a log file which shows that MSNbot came to the site from the ip of 65.55.106.115 and read the robots.txt file, the bot then requested the about.php page and left. However, shortly after, the MSNbot came back from a different ip address (this time 65.55.106.172) and tried to crawl the /forbidden directory. Whats weird here is that apparently the /forbidden directory is not linked to from anywhere so the only way the bot would know it existed is by reading and disregarding the robots.txt file. It might cross your mind to think that this is all a coincidence and that someone masquerading as MSNbot came along shortly after and tried to access /forbidden, however both ip address belong to Microsoft.

As i said earlier, it seems a bit strange that Microsoft would start to ignore robots.txt files, so after digging deeper it seems like there is a bug in the new MSNbot which means that it is actually reading the robots.txt on a complete different domain and then trying spider your site. Here is an example request from the spider…

GET /robots.txt HTTP/1.1
Accept: */*
Host: www.lumigan.com
User-Agent: msnbot/2.0b (+http://search.msn.com/msnbot.htm)
Connection: Keep-Alive
Cache-Control: no-cache
Pragma: no-cache

In this instance, the spider thinks that it is crawling www.lumigan.com but is in fact crawling a completely different website thus disregarding it’s robots.txt and indexing pages that shouldn’t be indexed. It’s at this point that Microsoft seemed to get wind of it and stated that they are looking into the problem.

bad_robot

The final piece of the puzzle comes from a post on one of Microsoft’s own social boards, where a user basically confirms what everyone else has been speculating…

For some reason, msnbot/2.0b is visiting the wrong IP addresses to retrieve robots.txt. In other words, it THINKS it is getting robots.txt for www.yoursite.com, but it is really reading the robots.txt file that is served for the default host at the IP address for www.mysite.com (not necessarily www.mysite.com’s robots.txt). Clearly, msnbot/2.0b is using the wrong DNS lookup for its requests.

So, we get confirmation that MSNbot is using the wrong DNS lookup for its requests and as such is definitely crawling sites based on the wrong robots.txt information. This is very concerning since areas on your website that you specifically do not want to be crawled, are being crawled and could end up being placed in to the Live SERPS.

Thankfully Brett from MSN yet again confirms that they are aware of the problem and they are trying to fix it. The problem is, no one seems to know when the fix will be complete or if the data that they have gathered in the past 4 months has already been used in the SERPS.

If you want to check to see if your site has been effected then i offer you the following advice from the above forum post…

Search your web log for requests from msnbot/2.0b. Do you see requests for links that don’t exist on your site? That’s because they exist on a different site, the one msnbot/2.0b THINKS it’s crawling . If you log the requested server name, do you see unfamilar hosts? Those are the ones msnbot/2.0b THINKS it’s visiting .

You could also just out right ban the MSNbot using an .htaccess line with something similar to the following…

RewriteCond %{HTTP_REFERER} ^msnbot/2\.0b [NC]
RewriteRule .* - [F,L]

//Returns a 403-Forbidden response and no content.

Hopefully Microsoft can get this issue resolved soon.

Using quote pages to get (sort of) contextual links

Look ma, Hookers

I was being all geeky and looking up some movie quotes to send to a friend, one’s that would help him sort out his messed up life (Hey Danny *waves*). And thought of a little idea which might be helpful in trying to score some links back to your site.

You can get moderately contextual links from pages which list quotes from favourite shows, lyrics or movies.

Let me explain. Lets say you run a small cell phone wallpapers/ringtones website and you are struggling to get people to link to you without paying through the nose for advertising. You could negotiate for a while to try and get the price down, or you could contact the many millions of sites out there which have a page on the movie ‘Phone booth’. Crap movie, but somewhat related to ‘telephones’ no? Now, the site owners probably have site wide ad’s on their quote site but are probably not to bothered about the actual content of their quote pages.

You could just hit them a quick email asking for a link to your site in the quote that you like, e.g

The Caller: Think about it. Why would a guy with a cell phone call a women everyday from a phone booth.
Pamela McFadden: He said it was quiet.
The Caller: Pam, that’s just stupid.

So then you could ask the site owner, if (s)he could link to you with the anchor text of “Cell phone“. Now even though the movie quotes site isn’t about phones, there is a lot of contextual references to phone/phones in that one page.

Since the site owner doesn’t have to write unique content or mess about trying to slip your link into the page, then your more than likely going to be able to get this link quite cheap. Unfortunately most quote pages are low page ranks such as 1 but can certainly aid a new site in a niche market.

So next time your thinking of where to buy links from, make a list of all the shows, movies, songs that you think are in anyway related to your search terms and hunt around quote pages, you’ll be surprised what you find 🙂

Yahoo gearing up to launch Google Analytics rival?

Earlier this year (the 9th of April to be exact) Yahoo issued a press release which explained that they had acquired the web analytics and tracking company IndexTools. Now 6 months later it seems that they might be looking to roll out that acquisition to the general public with Yahoo Web Analytics.

Whilst the Yahoo Web Analytics site is still plastered with ‘Coming Soon’ under almost every category it does show some promise to be a genuine rival to Google’s own analytics reporting. According to Yahoo…

From the acquisition of IndexTools, Yahoo! Web Analytics is born!
Yahoo! Web Analytics is an enterprise site analytics tool that provides real-time insight into visitor behavior on your website. With powerful and flexible tools and dashboards, Yahoo! Web Analytics helps online marketers and website designers enhance the visitor experience, increase sales and reduce marketing costs.

Obviously there is ton’s of other marketing spiel on the website but the part that caught my eye in particular is the ability to make custom designed reports to drill down on information that you want to specifically monitor. I have used GA for many a presentation to many a graph hungry CEO, but i have always found that it can lack in some areas such as complete custom reports, hopefully Yahoo can fill this void.

At the moment there isn’t a great deal to go on because only previous IndexTool’s users can access the system, but could we be seeing a serious rival to Google Analytics? I guess time will tell (probably 2009 to be honest), and when it becomes open to the general public then i’ll do a full review.

Antispore.com – Real Christian Crusade or Clever Marketing Ploy?

Unless you have been living under a rock (or simply have zero interest in video games what so ever) then you would have probably heard about a little PC game called Spore. Amongst the hype and controversy over things such as the draconian DRM or the less than stellar reviews, there is a website which has popped up called Antispore.

At first glance it seems that Antispore is exactly what it says in the url, a website that is against spore, if you bother to read the about us page you will find the following info…

I created this blog to find support for and follow my progress in letting Electronic Arts know that their biggest attack on Christian values to date will not be tolerated.
We can not allow the gaming industry to invade our homes and poison the minds of our children.
After all, their billions in revenue and all the advertising in the world are no match for the power of God.

Along with the above info on the about us page there are also posts such as “Proof EA is converting children to believe in evolution” and “The evil man behind it all – Will Wright” which makes one think that this is some crazy Christian evangelist trying to get their voice heard regarding a game about Evolution, however i am not convinced that this is all that there is to it.

If you read through all the posts the writer has obviously got quite a good grip on the bible but the witting style is also somewhat jovial as we can see in this particular post.

“21. The LORD smelled the pleasing aroma and said in his heart: “Never again will I curse the ground because of man, even though every inclination of his heart is evil from childhood. And never gonna give you up. 22. “Never gonna let you down.” 23.”Never gonna run around and desert you.” 24. “Never gonna make you cry.” 25. “Never gonna say goodbye.” 26. “Never gonna tell a lie and hurt you.” 27.”Never truly believe anything you read on the Internet. There will always be cases of Poe’s Law.”

As you can see there is a nice Rickroll (obvious Rickroll is obvious) hidden in the scriptures of the lord which i am pretty sure were not in the original old testament. So this obviously gives the game away that this site isn’t entirely genuine about being against Spore, so it seems to me that this site has been set up by some marketer who is generating a huge amount of publicity, buzz, and backlinks to the site which just happens to have Google ads on it.

It seems the first post was put on the site on September 8th just 3 days ago, if we go take a look at Yahoo site explorer then we can see that the site already has 321 backlinks pointing to it, on top of that there are backlinks from websites such as Forbes, Joystiq, and G4TV which are surely going to pass a ton of authority when Google catches up. On top of that you have people visiting the site from referrals in blog posts, imageboards, forums, chat rooms etc which are generating around 300 odd comments on some posts.

I can’t be sure if the person behind the site is just having a joke to rile people up or is someone with a seo/marketing background who realises that once they enough links they can simply do the old bait n switch and start selling the game through an affiliate scheme and no doubt make an absolute ton of cash. You could even just 301 the entire site to another domain which is already setup to sell spore through it.

Either way i think it is an exceptional idea, generate a lot of controversy, get a load of links to your site regarding a specific topic, and then switch the site to sell the very thing you were against.

I would love to hear other peoples opinions on this so leave me a comment.

Orange want you to search for ‘I am’

… And then don’t naturally rank for it…. durr!

You have probably seen the adverts on TV, some guy decided to ride a bike around the world to break the world record, and tells us how he is the woman that knocked him off his bike, and he is his mother who gave him all the courage to do it etc etc. At the end of the commercial Orange tell you to search for ‘i am’ online, in the hopes that you will then click through there site… except your probably part of the 60.5% who wont. Why you ask? Well because of this…

As you can see, Orange take up err NONE of the natural search results, and have a little tiny PPC ad at the top, just under the Google Search Bar. This seems quite strange that you would launch an entire campaign and then not even naturally rank for the term anywhere at all.

Chewie.co.uk – Now with 100% less Wookiee!