The Major AI Crawlers: Know Your Bots
To block AI bots, you need to know their User-agent strings. Here are the most common ones currently scouring the web for training data, along with their purposes and documentation sources.
OpenAI Crawlers
GPTBot: OpenAI's primary web crawler used to collect data for training GPT-4, GPT-5, and future models. It respects robots.txt and identifies itself clearly.
- User-agent:
GPTBot - Documentation: OpenAI's official GPTBot documentation
- Default behavior: Respects robots.txt; can be blocked individually
ChatGPT-User: This bot is different from GPTBot. It's activated when a ChatGPT Plus user uses the "Browse with Bing" feature to ask ChatGPT to read and summarize a specific URL. Blocking this bot prevents real-time summarization of your content in ChatGPT responses.
- User-agent:
ChatGPT-User - Purpose: Real-time content access for ChatGPT users
- Note: Blocking this does NOT affect GPTBot
Google AI Crawlers
Google-Extended: Announced in September 2023, this is Google's dedicated crawler for AI training. Critically, blocking Google-Extended does NOT affect your inclusion in Google Search results—those use separate crawlers (Googlebot).
- User-agent:
Google-Extended - Purpose: Training Google's Vertex AI, Gemini, and other AI models
- Key distinction: Completely separate from search indexing
Anthropic Crawlers
Anthropic-AI: The crawler for Anthropic (creators of Claude). Used to train Claude AI models.
- User-agent:
ClaudeBot(formerlyAnthropic-AI) - Documentation: Anthropic's official crawler documentation
- Status: Actively crawling the web as of 2024
Common Crawl (AI Training Dataset)
CCBot: Common Crawl is a non-profit that crawls the web and provides free datasets. These datasets are used by almost every major AI company (OpenAI, Meta, Google, Anthropic) as training data.
- User-agent:
CCBot - Purpose: Building open web datasets for research and AI training
- Impact: Even if you block GPTBot, your content might still end up in AI models if CCBot crawls and includes your site in their dataset
Meta (Facebook) AI Crawlers
Meta-ExternalAgent: Meta's crawler for AI training, including LLaMA models.
- User-agent:
facebookexternalhit(for link previews) andMeta-ExternalAgent(for AI training) - Purpose: Training Meta's AI models
- Note: Some Meta crawlers are used for social media link previews—blocking those also blocks previews on Facebook
Amazon AI Crawlers
Amazonbot: Amazon's crawler for general web discovery. Increasingly used for AI and product intelligence.
- User-agent:
Amazonbot - Documentation: Amazon's official bot documentation
- Respects robots.txt: Yes
How to Block All Major AI Crawlers
If you've decided to opt out of AI training data collection, here's a comprehensive robots.txt configuration to block the major crawlers.
Complete AI Blocking Configuration
# ============================================
# AI TRAINING CRAWLERS - COMPLETE BLOCKING CONFIGURATION
# Last updated: October 2024
# ============================================
# OpenAI - Primary training crawler
User-agent: GPTBot
Disallow: /
# OpenAI - Real-time ChatGPT user requests
User-agent: ChatGPT-User
Disallow: /
# Google - AI training (does NOT affect search ranking)
User-agent: Google-Extended
Disallow: /
# Anthropic (Claude AI) - Training crawler
User-agent: ClaudeBot
Disallow: /
# Anthropic (older User-agent - include for compatibility)
User-agent: Anthropic-AI
Disallow: /
# Common Crawl - Provides datasets used by multiple AI companies
User-agent: CCBot
Disallow: /
# Meta (Facebook) - AI training crawler
User-agent: Meta-ExternalAgent
Disallow: /
# Amazon - AI and product crawler
User-agent: Amazonbot
Disallow: /
# Apple - Siri and Spotlight AI (optional)
User-agent: Applebot
Disallow: /
# ByteDance (TikTok) - AI crawler
User-agent: Bytespider
Disallow: /
# ============================================
# NOTE: The following bots have legitimate uses
# Block only if you're sure
# ============================================
# Facebook link previews (blocking breaks social sharing)
# User-agent: facebookexternalhit
# Disallow: /
# Twitter/X card previews (blocking breaks social sharing)
# User-agent: Twitterbot
# Disallow: /
# LinkedIn previews (blocking breaks social sharing)
# User-agent: LinkedInBot
# Disallow: /
Explanation of Each Block
- GPTBot & ChatGPT-User: Blocks both training and real-time access
- Google-Extended: Optional but recommended if you want to opt out of Google's AI training
- ClaudeBot/Anthropic-AI: Blocks Anthropic's crawler (use both for compatibility)
- CCBot: Critical if you want to avoid your content appearing in Common Crawl datasets
- Meta-ExternalAgent: Blocks Meta's AI training (separate from social previews)
- Amazonbot & Bytespider: Blocks e-commerce and TikTok AI crawlers
Less Restrictive Alternative: Rate Limiting Instead of Blocking
If you don't want to completely block AI crawlers but want to reduce their server impact:
# Allow but throttle AI crawlers
User-agent: GPTBot
Crawl-delay: 10
# (Not allowing or disallowing - default is allow)
User-agent: CCBot
Crawl-delay: 5
This tells crawlers they can access your site, but only at a slow rate. Note: Not all AI crawlers support Crawl-delay.
Should You Block AI Crawlers? The Pros and Cons
Before you rush to block every AI bot, consider these trade-offs. The decision depends on your business model, content strategy, and values.
Pros of Blocking AI Crawlers
- Protect Intellectual Property: Prevent LLMs from reproducing your unique writing, research, analysis, or creative work without attribution or compensation.
- Maintain Competitive Advantage: If your content is proprietary (research, data analysis, industry insights), giving it to AI models for free could train your competitors' AI tools.
- Save Server Resources: AI crawlers can be aggressive. Blocking them saves bandwidth and server CPU. Some publishers report AI crawlers consuming 2-5% of their bandwidth.
- Legal and Ethical Concerns: Many content creators object to having their work used to train commercial AI models without permission or payment. Blocking is a form of opting out.
Cons of Blocking AI Crawlers
- Zero Visibility in AI Chatbots: If you block
ChatGPT-Userand a ChatGPT user asks "Summarize the latest article from [Your Website]," ChatGPT will respond that it cannot access the site. You lose potential brand exposure in the AI ecosystem. - Missed Referral Traffic: Some AI chatbots (Perplexity, You.com) provide citations and links back to sources. Blocking crawlers prevents these citations and potential referral traffic.
- The Data Might Already Be Gone: If your site has been online for years, your content is likely already in Common Crawl datasets and baked into existing models (GPT-4, Claude 2, etc.). Blocking now only prevents future data collection.
- Honor System Only: Robots.txt is a voluntary standard. Many aggressive AI scrapers, especially those from less reputable companies or individuals, ignore robots.txt entirely.
Decision Matrix: Should You Block?
| Your Site Type | Recommendation | Reasoning |
|---|---|---|
| News/Journalism | Likely Block | Original reporting is valuable IP. Many news orgs are negotiating (or suing) over AI training. |
| Educational/Reference | Consider Carefully | AI chatbots could drive referral traffic from students, but may also summarize without linking. |
| E-commerce/Product | Likely Allow | Product descriptions are less unique; AI referrals could drive sales. |
| Personal Blog | Personal Choice | Depends on whether you want your writing to influence AI (allow) or keep it human-only (block). |
| Corporate/Brand Site | Likely Block | Brand voice and messaging are proprietary. You don't want AI mimicking your brand without control. |
Legal Aspects: Is Blocking Enough?
Understanding the legal landscape around AI crawling is essential for publishers.
Robots.txt as a Legal Signal
Courts have historically considered robots.txt as expressing a website owner's consent preferences. In the 2000s, eBay successfully argued in court that ignoring robots.txt constituted trespass to chattels (unauthorized use of computer systems).
For AI training, several class-action lawsuits (including against OpenAI and Meta) argue that ignoring robots.txt to scrape content for AI training violates:
- Terms of Service violations
- Copyright infringement (for reproducing content in training data)
- Computer Fraud and Abuse Act violations
Terms of Service vs. Robots.txt
Robots.txt is machine-readable; Terms of Service are human-readable. For stronger legal protection:
- Include explicit prohibitions on AI training in your website's Terms of Service
- Reference your robots.txt in your terms
- Use both together for layered protection
Sample Terms of Service Clause
AI Training and Web Scraping
You may not use any automated system, including without limitation "robots," "spiders," or "offline readers," to access our website for the purpose of training artificial intelligence models or machine learning systems without our express written permission. Our robots.txt file expresses our consent preferences, and accessing our site in violation of those directives is strictly prohibited.
Legal Limitations of Robots.txt
Despite its legal recognition, robots.txt has limitations:
- Not a law: Violating robots.txt is not automatically illegal everywhere
- Requires enforcement: You would need to sue violators to enforce your preferences
- International variation: Different countries have different laws regarding web scraping
- Existing data: Once data is scraped, it's difficult to force deletion
How to Monitor AI Crawler Activity
After implementing blocking rules, you should monitor your server logs to ensure AI crawlers are respecting your robots.txt.
Checking Server Logs
Search your access logs for AI crawler User-agent strings:
# Check for GPTBot
grep "GPTBot" /var/log/nginx/access.log
# Check for CCBot
grep "CCBot" /var/log/nginx/access.log
# Check for ClaudeBot
grep "ClaudeBot" /var/log/nginx/access.log
# Count requests by User-agent
awk '{print $12}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20
Tools for Monitoring AI Crawlers
- Cloudflare Bot Management: Identifies and categorizes AI crawlers
- Dark Visitors: A database of AI crawler User-agents with blocklists
- Custom Log Analysis: Set up alerts for specific AI User-agents
What to Do If Crawlers Ignore Your Robots.txt
If you find AI crawlers are still accessing your site despite robots.txt blocking:
- Verify your robots.txt is accessible: Check
https://yourdomain.com/robots.txtreturns 200 OK - Verify the User-agent matches: Some crawlers use slightly different strings (e.g., "GPTBot/1.0")
- Contact the provider: OpenAI, Google, and Anthropic have reporting mechanisms for ignored directives
- Implement server-level blocking: Use firewall rules or CDN settings to block the IP ranges
- Consider legal action: For persistent violators, consult an attorney
Server-Level Blocking Example (Nginx)
if ($http_user_agent ~* "GPTBot|CCBot|ClaudeBot|Anthropic-AI") {
return 403;
}
This blocks the request entirely at the web server level, even before the request reaches your application.
Emerging AI Crawlers to Watch
The AI crawler landscape is changing rapidly. Here are emerging players to monitor:
Perplexity AI
User-agent: PerplexityBot (also respects robots.txt)
Perplexity is an AI answer engine that provides citations. Unlike ChatGPT, Perplexity prominently links back to sources and can drive significant referral traffic. Many publishers choose NOT to block Perplexity.
Cohere AI
User-agent: CohereBot
Cohere is an enterprise AI company that trains models for business applications. Less well-known than OpenAI but growing rapidly.
Midjourney (Image AI)
User-agent: MidjourneyBot
Trains on images. If you have original photography or artwork, you may want to block this crawler.
Diffbot (Web Data Extraction)
User-agent: Diffbot
Provides structured web data for AI applications. Some publishers block, others see it as a legitimate API-like service.
How to Stay Updated
- Bookmark Dark Visitors (darkvisitors.com) for an up-to-date AI crawler database
- Monitor Reddit r/SEO and Hacker News for discussions of new crawlers
- Review your server logs weekly for unfamiliar User-agent strings
- Set up alerts for new User-agents with moderate request volumes
Template for Regular Review
Create a monthly task to review and update your AI crawler blocklist:
- Check Dark Visitors for newly identified AI crawlers
- Search your server logs for the previous month's new User-agents
- Research any unfamiliar User-agents that appear in volume
- Update your robots.txt and server-level blocking rules
- Test the updated configuration
