Robots.txt? What is that?

When Googlebot visits your website, one of the first things it will do is look for a robots.txt file. Google recommend it in their webmaster guidelines that you make use of robots.txt. So what is it?

The robots.txt file is a simple text file created in a simple text editor like notepad. It instructs search engine robots. It is uploaded to and resides in the root directory of a website (put simply, the "/" directory where the main index or home page is). It tells search engines what may and may not be indexed - what may and may not appear in the results pages returned from searches.

If, for example, you have personal information on your site, like clients addresses, then you would ban the search engine bots from the relevant directory/files. If you want to reduce the likelihood of image theft through your images appearing in image searches, one method might be to put your images in a separate directory on your site and ban the search engine bots from indexing that directory.

You can look at other sites robots.txt files by typing the website domain name followed by robots.txt

eg: http://www.mydepictions.co.uk/robots.txt
.

They do vary and are specific to that site, so do not assume a file can just be replicated. They absolutely must be accurate, there is no room for typing errors, or missing white spaces! Get it wrong and you could risk being excluded from the search engine results pages altogether.

If you have an empty file named robots.txt, Googlebot will visit the site and index all the contents. If you dont have a robots.txt file Googlebot will again visit the site and index all the contents. Some exceptions to this being: problems on the server where the website is hosted and no-index meta tags.

Now you could ban the bots from pages you want excluded from search results using the no-index meta tags on each page. The beauty of the robots.txt file is that it can be applied to a whole directory or file types. Equally, e.g. rather than have <meta name="robots" content="INDEX,FOLLOW"> on all the individual pages in a site, a couple of lines in a robots.txt file can be applied to the whole site thus:

User-agent: *
Disallow:

The above applies to all robots and nothing is disallowed from the search engine indexes. If you want to instruct the bots to index the entire contents of your site and wish to make use of robots.txt, as Google recommends, you can copy and paste the above two-line code into notepad, save as a txt file format, name the file robots.txt (note the file name is all lower case), then upload to your root directory. (Do bear in mind the exceptions - a noindex meta tag on a page will be obeyed.) So simple for something that seems on the face of it appears quite technical!

If you have your site verified by Google, the webmaster tools console provides an option for testing your robots.txt file. If you don't have a robots.txt file, or an empty file named robots.txt, the webmaster tools console will show an error 404. As Google state, this is correct if the file is non-existent. Now you could head over to the robots.org site for further information, but if you just want some quick, simple, basic, advice I like feedthebot.com's Robots Text Files page.

Mydepictions Free Web Design Resources Page

2 comments:

Anonymous said...

y, This is Patrick Sexton from Feedthebot and SEOish.
I wanted to thank you for the mention but the link you give me is broken! If you get a minute, can you correct that please?
http://www.feedthebot.com/robottxt.html

Thanks,
Pat

Naj said...

Hi Pat,
I think you may have caught an earlier version I published. It all seems to be working now. Please let me know if its not!