Robots
From JumbaWiki
Internet robots, also known as web crawlers or web spiders are server-side software agents that automatically scan through websites and their linked sites to map out the internet or gather updated information (data mining), most commonly used are search engine bots (such as the GoogleBot).
- There are other types of "robot" software not discussed in this article, eg spam bots. See article: Spam
Contents |
robots.txt
The robots exclusion standard or robots.txt protocol is used to prevent robots from indexing parts of a site. This is done by placing a robots.txt file in the top-most directory of the site.
Note: robots.txt can't really stop robots from indexing the site, it is merely advising the robot what you would like indexed. Well behaved bots from google, yahoo, etc will take notice of robots.txt but others will quite happily ignore it (and could even deliberately index exactly what you don't want them to! Therefore you should not rely exclusively on robots.txt to protect sensitive parts of your site.
Changing the robots.txt file
This example keeps all robots out:
User-agent: * Disallow: /
This example keeps all robots out of 4 directories:
User-agent: * Disallow: /cgi-bin/ Disallow: /images/ Disallow: /tmp/ Disallow: /private/
This example restricts one particular bot from entering a directory
User-agent: GoogleBot Disallow: /private/
HTML robot restrictions
You can use this in the head tag to restrict links being indexed:
<meta name="robots" content="noindex,nofollow" />

