Support for x-robots-tag and robots HTML meta tag
As part of our research for our post on how we block search engines, we looked into which search engines support which privacy standards. This information doesn't seem to exist anywhere else on the Internet, so below are our findings, starting with the big guys, and moving towards more obscure or foreign search engines.
Google, Bing
Google (known as Googlebot) and Bing (known as Bingbot) support the x-robots-tag header and the robots HTML tag. Here's Google's page on the topic. And here's Bing's. The msnbot is retired.
Yahoo, AOL
Yahoo!'s search engine is provided by Bing. AOL's is provided by Google. These are easy ones.
Ask, Yandex, Nutch
Ask (known as teoma), and Yandex (Russia's search engine, known as yandex), support the robots meta tag, but do not appear to support the x-robots-tag. Ask's page on the topic is here, and Yandex's is here. The popular open source crawler, Nutch, also supports the robots HTML tag, but not the x-robots-tag header. Update: Newer versions of Nutch now support x-robots-tag!
The Internet Archive, Alexa
The Internet Archive uses Alexa's crawler, which is known as ia_archiver. This crawler does not seem to support either the HTML robots meta tag nor the x-robots-tag HTTP header. Their page on the subject is here. I have requested more information from them, and will update this page if I hear back.
Duckduckgo, Blekko, Baidu
Duckduckgo and Blekko do not support either the robots meta tag nor the x-robots-tag header, per emails I've had with each of them. I also requested information from Baidu, but their response totally ignored my question and was in Chinese. They do have some information here, but it does not seem to provide any information on the noindex value for the robots tag. In any case, the only way to block these crawlers seems to be via a robots.txt file.