What is a Robots.txt File?
A robots.txt file is a simple text file placed in the root directory of your website. It uses the Robots Exclusion Protocol (REP), a standard established in 1994, to tell web robots (often called crawlers or spiders) which pages or files they are allowed to request from your site.
Think of it as a "Code of Conduct" or a "Do Not Enter" sign for search engine bots. When a bot like Googlebot visits your site, the very first thing it does is look for yourdomain.com/robots.txt to see if it has permission to crawl the pages it wants to visit.
The robots.txt file is not a security measure—it's a courtesy protocol that well-behaved crawlers respect. Malicious scrapers and bots will ignore it entirely. For sensitive content, always use proper authentication or password protection.
Why is Robots.txt Important for SEO?
While a robots.txt file won't magically rank your site higher on Google, it plays a vital role in crawl budget optimization, preventing duplicate content issues, and protecting server resources.
1. Maximizing Your Crawl Budget
Search engines allocate a specific "crawl budget" to your site—the number of pages a search engine bot will crawl and index within a given timeframe. For large websites (e.g., an e-commerce site with thousands of faceted navigation URLs), bots might waste their budget crawling useless pages like:
- Shopping cart and checkout URLs
- Internal search result pages
- Filtered product variations (color=red&size=large&sort=price)
- Tag and category archives with thin content
- Print-friendly versions of pages
By disallowing these non-essential paths in your robots.txt, you force search engines to spend their crawl budget on your most important, revenue-generating pages.
2. Preventing Duplicate Content Issues
Many CMS platforms generate multiple URLs that display the same content. For example, a single blog post might be accessible at:
/blog/post-title//blog/category/post-title//blog/author/post-title//blog/date/2024/10/post-title/
Without proper direction, search engines might index all these variations, splitting link equity and confusing their algorithms about which version to rank. Robots.txt can block the duplicate pathways while keeping the canonical version accessible.
3. Preventing Server Overload
Some bots are aggressive and can send hundreds of requests per second, potentially overloading your server and slowing down your site for actual human visitors. While Google handles crawling intelligently with its "crawl rate" algorithms, other bots might need explicit throttling. You can use the Crawl-delay directive to tell bots to wait a few seconds between requests.
4. Keeping Private and Staging Files Out of Search Results
If you have staging areas, internal administration panels (like /wp-admin/), or directories containing sensitive PDFs, you can use robots.txt to stop bots from crawling them. Common examples include:
/staging/- Development environments/internal/- Employee-only resources/temp/- Temporary or test files/backup/- Database or file backups
noindex meta tag instead of relying on robots.txt alone.
Basic Syntax of Robots.txt
A robots.txt file consists of one or more blocks of rules. Each block begins with a User-agent line, followed by one or more Allow or Disallow directives. The file is plain text (UTF-8 encoding recommended) and must be named exactly robots.txt (case-sensitive).
The User-agent Directive
The User-agent specifies which crawler the following rules apply to. You can target specific bots or use the wildcard to target all bots.
User-agent: *— Applies to ALL bots that respect robots.txtUser-agent: Googlebot— Applies only to Google's web crawlerUser-agent: bingbot— Applies only to Microsoft's Bing crawlerUser-agent: Googlebot-Image— Applies only to Google's image crawlerUser-agent: GPTBot— Applies to OpenAI's AI training crawler
You can include multiple User-agent blocks for different crawlers. If multiple blocks apply to the same bot, the most specific (longest matching) User-agent takes precedence.
The Disallow Directive
The Disallow directive tells crawlers which URL paths they should not crawl. It accepts a partial path, and any URL starting with that path will be blocked.
Disallow: /— Blocks crawling of the entire website (use with extreme caution)Disallow: /private/— Blocks crawling of everything inside the /private/ directoryDisallow: /search— Blocks crawling of /search, /search?q=test, /search.html, etc.Disallow: /admin— Blocks crawling of admin directories and admin-panel URLsDisallow:(empty value) — Allows crawling of everything (equivalent to having no disallow rules)
The Allow Directive
The Allow directive is used to override a Disallow rule for a specific subpath. This is useful when you want to block an entire directory but allow a specific file or subdirectory within it.
Allow: /public/visible-file.html— Allows crawling of a specific file even if its parent directory is disallowedAllow: /blog/allowed-directory/— Allows a specific subdirectory within a blocked parent
Without the Allow directive, you would have to list every single file you wanted to block individually.
- Each directive must be on its own line
- There must be a colon and a space after the directive name (e.g.,
Disallow: /path) - Paths are case-sensitive on most servers (
/Aboutand/aboutare different) - Empty lines are ignored
- Lines starting with
#are comments and are ignored
Complete Guide to Robots.txt Directives
Beyond the basic Allow and Disallow directives, there are several other directives you should know about.
Sitemap Directive
Tells crawlers where to find your XML sitemap. This directive can appear anywhere in the file but is conventionally placed at the end.
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml
Sitemap: https://example.com/sitemap-images.xml
You can list multiple sitemaps, one per line.
Crawl-delay Directive
Requests that a crawler wait a specified number of seconds between successive requests. Note: Googlebot ignores this directive entirely.
User-agent: bingbot
Crawl-delay: 5
User-agent: YandexBot
Crawl-delay: 1
Host Directive (Yandex-specific)
Yandex supports a Host directive that specifies the preferred domain (www vs. non-www) when your site is accessible via multiple URLs.
Host: https://www.example.com
Clean-param Directive (Google-specific, deprecated)
Previously used to tell Google which URL parameters don't change page content. This functionality has been moved to Google Search Console's URL Parameters tool.
Using Wildcards in Robots.txt
Most major search engines (including Google, Bing, and Yandex) support wildcard characters in robots.txt patterns, though the original specification didn't include them.
The Asterisk (*) Wildcard
The * character matches any sequence of characters (including zero characters).
# Block all URLs containing "sessionid"
Disallow: /*sessionid=
# Block all URLs with the "print" parameter
Disallow: /*print=true
# Block all temporary files (with numeric suffixes)
Disallow: /temp/*.tmp
# Block all URLs ending with .pdf (blocks PDF files)
Disallow: /*.pdf
The Dollar Sign ($) Wildcard
The $ character matches the end of a URL, allowing for exact pattern matching.
# Block only the exact URL /search (not /search?q=test)
Disallow: /search$
# Block only URLs ending with .php (not .php?id=123)
Disallow: /*.php$
Combining Wildcards
# Block all URLs with "page=" parameter AND ending with "0"
Disallow: /*page=*0$
Where Does the Sitemap Go?
Your robots.txt file is the perfect place to declare your XML sitemap. Simply add the following line at the bottom of the file (it can technically be anywhere, but convention is the end):
Sitemap: https://www.yourdomain.com/sitemap.xml
This makes it incredibly easy for search engines to find all your important pages immediately after reading your crawling rules. You can also list multiple sitemaps if you have separate ones for different content types:
Sitemap: https://www.yourdomain.com/sitemap-pages.xml
Sitemap: https://www.yourdomain.com/sitemap-posts.xml
Sitemap: https://www.yourdomain.com/sitemap-products.xml
Sitemap: https://www.yourdomain.com/sitemap-images.xml
Why this matters: While you can submit sitemaps directly through Google Search Console and Bing Webmaster Tools, listing them in robots.txt ensures that any crawler that respects robots.txt can find your sitemap automatically.
Location and Accessibility Requirements
For robots.txt to work correctly, it must be placed in the exact right location and be accessible to crawlers.
Correct Location
The robots.txt file MUST be placed in the root directory of your domain. This is the top-level directory from which your website is served.
- Correct:
https://www.example.com/robots.txt - Incorrect:
https://www.example.com/assets/robots.txt - Incorrect:
https://www.example.com/subdirectory/robots.txt
Note that subdomains have their own root directories. blog.example.com/robots.txt is separate from example.com/robots.txt.
File Naming
The filename must be exactly robots.txt (lowercase). Robots.txt, ROBOTS.TXT, or robots.TXT will not work correctly on case-sensitive servers.
HTTP Status Codes
The robots.txt file must return a 200 OK HTTP status code. Common problems include:
- 404 Not Found: No robots.txt file exists (this is fine—crawlers assume no restrictions)
- 302 Redirect: The file is redirecting elsewhere (crawlers may not follow the redirect)
- 500 Server Error: The file cannot be accessed (crawlers may assume the worst and avoid crawling)
Testing Accessibility
You can test if your robots.txt is accessible using curl:
curl -I https://example.com/robots.txt
Look for HTTP/2 200 in the response.
How to Check Your Robots.txt Status
After creating or modifying your robots.txt file, always verify it works as expected.
Google Search Console Robots.txt Tester
The most reliable tool is Google's Robots.txt Tester, available in Google Search Console:
- Log into Google Search Console
- Navigate to "Settings" → "Crawling" → "robots.txt Tester"
- View your current file or upload a new version
- Test specific URLs to see if they're allowed or blocked
This tool applies Google's specific parsing rules, so it's the gold standard for Googlebot compatibility.
Bing Webmaster Tools
Bing offers a similar tool in their Webmaster Tools dashboard under "Configure My Site" → "Robots.txt Tester".
Manual Testing with cURL
# Fetch the robots.txt file
curl https://example.com/robots.txt
# Check HTTP headers
curl -I https://example.com/robots.txt
Online Testing Tools
- Google's Robots Testing Tool (standalone version)
- SEO Site Checkup Robots.txt Validator
- SmallSEOTools Robots.txt Generator/Validator
Real-World Robots.txt Examples
Example 1: Basic Blog Setup
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /search/
Allow: /wp-content/uploads/
Sitemap: https://example.com/sitemap.xml
This setup blocks admin and system directories while allowing uploaded images. The search results page is also blocked to prevent thin content from being indexed.
Example 2: E-commerce Site
User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Disallow: /search/
Disallow: /*filter=*
Disallow: /*sort=*
Disallow: /*page=*&
Allow: /products/
Sitemap: https://example.com/sitemap-products.xml
Sitemap: https://example.com/sitemap-categories.xml
This blocks shopping cart, checkout, and account pages, along with filtered and sorted product URLs that create duplicate content. Parameter-based blocking with wildcards prevents endless URL variations.
Example 3: Multi-bot Configuration
# Allow all bots except specific ones
User-agent: *
Disallow:
# Slow down aggressive bots
User-agent: YandexBot
Crawl-delay: 2
User-agent: Baiduspider
Crawl-delay: 3
# Block AI scrapers entirely
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
Sitemap: https://example.com/sitemap.xml
This configuration allows all bots by default, slows down aggressive international crawlers, and blocks AI training bots entirely.
Example 4: Staging/Development Site
# Block all crawlers from staging site
User-agent: *
Disallow: /
# But allow Google's testing tools (optional)
User-agent: Googlebot
Disallow: /
For staging sites, completely blocking all crawlers is often the right choice to prevent test content from appearing in search results.
Frequently Asked Questions
Q: How long does it take for robots.txt changes to take effect?
Google typically recrawls robots.txt within 24-48 hours. For urgent changes, you can request a recrawl of robots.txt in Google Search Console. Other search engines may take several days to notice changes.
Q: Can I use robots.txt to prevent a page from appearing in search results?
No. Robots.txt only prevents crawling, not indexing. If other sites link to a page you've disallowed, Google can still index the URL (it just won't have a description). To remove a page from search results, use the noindex meta tag or URL removal tool.
Q: Does robots.txt hide pages from users?
No. Robots.txt only affects crawlers. Users can still access blocked pages directly or via links. If you need to restrict access to users, use password protection or IP whitelisting.
Q: What happens if I don't have a robots.txt file?
Nothing bad! Crawlers assume no restrictions and will attempt to crawl all accessible URLs. Many small websites don't need a robots.txt file at all.
Q: Can I have multiple robots.txt files for different subdomains?
Yes. Each subdomain (e.g., blog.example.com, shop.example.com) has its own root directory and can have its own robots.txt file. They don't inherit rules from the main domain.
