robots.txt

robotThis is a small file that points and guides spiders as to what they can index and what they should keep away from.

There may be areas on a site that that you want to block off from spiders so that that information in those sections does not (or should not) end up on the web. CMS systems have areas that you don’t want found on the Net so you use the robots.txt file to ensure that.

WordPress, for example, suggest you block about 13 folders through your robots.txt file, Joomla about 10. We have tagging on the blog portion of the site for pages. Because of this you could reach the same page in a number of different ways and thus a number of url’s. Therefore we have the labelling URL’s blocked so that they can’t be indexed and the page doesn’t end up on the web with two url’s and giving us a duplicate content issue.

You have to be careful of the syntax in the file for example:

User-agent: *

Disallow: /

This blocks everything, it instructs robots not to index anything

User-agent: *

Disallow:

Without the trailing slash, this now means index everything. It’s only a subtle change and the use of the word disallow would logically suggest that spiders are blocked, but they are not. You can have a look at the syntax used here.

While the robots.txt file will ask the robots to conform to your request, that data may still end up indexed on the web. The best way to ensure that your data is kept safe is not to have it on the web if at all possible. If it is necessary to have these files on a server you might consider password protecting them through something like .htaccess.

If you do make changes to your file, make sure that it is instructing the spiders to do what you want. The best way to see what they will crawl and what they will not crawl is to run your site through “Link Sleuth”. It follows your robots.txt file and will therefore show you what’s up for grabs and what’s not.