10 common mistakes using robots.txt on your website

10 common mistakes using robots.txt on your website

March 9th, 2009 // 7:44 am @ // 18 Comments

Robots.txt is a special file which is located in the root of each server which is a plan text file which allows the administrator of a website to define which web content need to be allowed and disallowed for the bot which visitors their website.

All major search engine like Google, Yahaoo and MSN agrees to the Robots Exclusion Protocol. There are several elements that every website owner need to understand for a easing crawling of their website. Following are the top 10 common mistakes to be avoided while create a robots.txt file.

1. Adding robots.txt not under the root directory - This is one of the common mistake webmaster do. They upload the robots.txt file at the wrong place it must reside in the root of the domain and must be named “robots.txt”. A robots.txt file uploaded in subdirectory is not a valid one since blots check for robots.txt file only in the root of the domain name.

User-agent: *
Disallow:

2. Wrong syntax in robots.txt – Another explanation is that the Webmaster used the wrong syntax when creating the robots.txt. Therefore, always double check the robots.txt file using tools like Robots.txt Checker
Here is an example

User-agent: *
Disallow: private.html

We advise you to start a file/directory name with a leading slash char (Example: /private.html).

3. Adding comment at the end of the sentence instead of at the beginning – If you wish to include comments in your robots.txt file, you should precede them with a # sign like this:

# Here are my comments about this entry.
User-agent: *
Disallow:

4. Empty robots.txt file almost like not having one – If you have created a robots.txt file under your root directory and there is nothing in it, then it is similar like not having one. Because nothing is disallowed or no User-agent is given, everything is allowed for every bots.

5. Blocking the pages which you need to get indexed - If  you are blocking spider bots and pages using robots.txt you should have thorough understanding of the syntax to be used any mistake can cause you huge problem with the spiderbots.

6. URL’s Paths are case sensitive – URL paths are often case sensitive, so be consistent with the site capitalization WARNING! Many robots and webservers are case-sensitive. So this path will not match any root-level folders named private or PRIVATE.

7. Misspelled robots/user agent names – SpiderBots will ignore mispelled User-Agent names. Check out your raw server log to find User-Agent name which you need to be blocked. Check out UserAgentString.com for a list of User Agent name.

8. Don’t add all the files in one single line – Some of the common mistake is adding all the files under on disallow.
For example

User-agent: *
Disallow: /private/ /images/ /javascript/

This is a wrong syntax and robots will not understand this format. The correct syntax is given below.

User-agent: *
Disallow: /private/
Disallow: /images/
Disallow: /javascript/

9. No allow command in robots.txt - There is only one command that is Disallow: and there is no command called Allow: So if you want to allow the bots to visit the page just don’t add the files.

10. Missing the colon – Missing the colon in Disallow and User-agent entry. Here is one of the example of a missing colon entry.

#This is a wrong entry
User-agent: googlebot
Disallow /

#The correct entry
User-agent: googlebot
Disallow: /

Please leave your comment if you find any other common mistakes which need to be avoided while generating a robots.txt file. Also below are few robots.txt useful resources and tools.

http://www.mcanerin.com/en/search-engine/robots-txt.asp
http://webtools.live2support.com/se_robots.php
http://googlewebmastercentral.blogspot.com/2008/03/speaking-language-of-robots.html

If you enjoyed this post, please consider leaving a comment or subscribing to the RSS feed to have future articles delivered to your feed reader.

Category : Search Engine Optimization &Website Security

18 Comments → “10 common mistakes using robots.txt on your website”


  1. Designlabel

    2 years ago

    Thanks for this list. Bookmarked


  2. jolinarodriguez

    2 years ago

    Hello Thomson very informative about seo and good advice in adding robots.txt in our website very interesting.


  3. WebDesigner

    2 years ago

    nice tips….. thanks


  4. Thomson

    2 years ago

    Jolina,
    Thanks for the comments and good luck with your website. Also let me know if you have any problem while installing the robot.txt file on your site.


  5. Tuan Anh

    2 years ago

    This post is basic, but useful. Thanks!


  6. Michael

    2 years ago

    Hey,

    If I want the bot to read index.html in my root, but nothing else on the site, how would I do this?

    Thanks ya’ll!


  7. Jerrol Krause

    2 years ago

    Another invaluable resource is Google’s Webmaster Tools (www.google.com/webmasters/tools/). They have a whole section dedicated to not only helping you build and test your robots.txt file, but will actually give you a list of URLs actually blocked by the Google.


  8. Kathy

    2 years ago

    great post , lerned lots… and you answered questions that need to be answered


  9. Mahesh

    2 years ago

    Thanx for the info!


  10. Phaoloo

    2 years ago

    Helpful tips, can imagine these mistakes are so common


  11. blpgirl

    2 years ago

    You have come up with a very nice list, great information.

    Just one thing, google do support the “Allow:”command, one use they make is if you want to disallow your pages to be crawled but you want to keep showing Adsense ads on those pages then you do a disallow for all bots and allow for the google bot that crawls pages with ads.


  12. vetweb

    2 years ago

    This is awesome. Thanks for the list!


  13. Analog Designer

    2 years ago

    Hello,

    27 (out of 100) of my site links have been blocked by robots.txt file. Would it be a problem for getting traffic to our site? If so, what should i do to unblock the links

  14. Thanks for providing this info.

  15. Thanks for valuable information We will follow your rules to create robots txt file for our websites.


  16. Rakesh

    1 year ago

    Ideal robots.txt file is as follows

    User-agent: *
    Crawl-delay: 2
    Disallow: /cgi-bin
    Disallow: /wp-admin
    Disallow: /wp-includes
    Disallow: /wp-content/plugins
    Disallow: /wp-content/cache
    Disallow: /wp-content/themes
    Disallow: /category
    Disallow: /tag
    Disallow: /author
    Disallow: /trackback
    Disallow: /*trackback
    Disallow: /*trackback*
    Disallow: /*/trackback
    Disallow: /*?*
    Disallow: /*.html/$
    Disallow: /*feed*

    # Google Image
    User-agent: Googlebot-Image
    Disallow:
    Allow: /*

    # Google AdSense
    User-agent: Mediapartners-Google*
    Disallow:
    Allow: /*

  17. Well Written article robots.txt file. This post is most recommendable to read every Web Master


  18. Mal

    1 year ago

    RE: #9 — google’s own robots.txt has many ‘Allow’ statements: http://www.google.com/robots.txt

    So, is this still right?


Leave a Reply

Testimonials

"Thomson Chemmanoor is an authority when it comes to the field of search engine marketing. Through his persistence in deep understanding and advanced execution, I came to the realize what it takes to become successful in our field"

Conrad E. Salvador, SEO Writer Coordinator, Directory One

Subscribe Now