How to control search engines and web crawlers using the robots.txt file

You can specify which sections of your site you would like search engines and web crawlers to index, and which sections they should ignore. To do this, you specify directives in a robots.txt file, and place the robots.txt file in your document root directory.

The directives you specify in a robots.txt file are only requests. Although most search engines and many web crawlers respect these directives, they are not obligated to do so. Therefore, you should never rely on the robots.txt file to hide content you do not want indexed.

Table of Contents

Using robots.txt directives
More Information

Using robots.txt directives

The directives used in a robots.txt file are straightforward and easy to understand. The most commonly used directives are User-agent, Disallow, and Crawl-delay. Here are some examples:

Example 1: Instruct all crawlers to access all files

User-agent: *
Disallow:

In this example, any crawler (specified by the User-agent directive and the asterisk wildcard) can access any file on the site.

Example 2: Instruct all crawlers to ignore all files

User-agent: *
Disallow: /

In this example, all crawlers are instructed to ignore all files on the site.

Example 3: Instruct all crawlers to ignore a particular directory

User-agent: *
Disallow: /scripts/

In this example, all crawlers are instructed to ignore the scripts directory.

Example 4: Instruct all crawlers to ignore a particular file

User-agent: *
Disallow: /documents/index.html

In this example, all crawlers are instructed to ignore the documents/index.html directory.

Example 5: Control the crawl interval

User-agent: *
Crawl-delay: 30

In this example, all crawlers are instructed to wait at least 30 seconds between successive requests to the web server.

More Information

For more information about the robots.txt file, please vist http://www.robotstxt.org.

Article Details

Level: Beginner

Knowledge Base