AI & SEO📖 11 min read📅 August 30, 2024

How to Block AI Scrapers (GPTBot, CCBot, Google-Extended) Using Robots.txt

Sarah Jenkins

Content Strategist

The Major AI Crawlers: Know Your Bots

To block AI bots, you need to know their User-agent strings. Here are the most common ones currently scouring the web for training data, along with their purposes and documentation sources.

OpenAI Crawlers

GPTBot: OpenAI's primary web crawler used to collect data for training GPT-4, GPT-5, and future models. It respects robots.txt and identifies itself clearly.

User-agent: GPTBot
Documentation: OpenAI's official GPTBot documentation
Default behavior: Respects robots.txt; can be blocked individually

ChatGPT-User: This bot is different from GPTBot. It's activated when a ChatGPT Plus user uses the "Browse with Bing" feature to ask ChatGPT to read and summarize a specific URL. Blocking this bot prevents real-time summarization of your content in ChatGPT responses.

User-agent: ChatGPT-User
Purpose: Real-time content access for ChatGPT users
Note: Blocking this does NOT affect GPTBot

Google AI Crawlers

Google-Extended: Announced in September 2023, this is Google's dedicated crawler for AI training. Critically, blocking Google-Extended does NOT affect your inclusion in Google Search results—those use separate crawlers (Googlebot).

User-agent: Google-Extended
Purpose: Training Google's Vertex AI, Gemini, and other AI models
Key distinction: Completely separate from search indexing

Anthropic Crawlers

Anthropic-AI: The crawler for Anthropic (creators of Claude). Used to train Claude AI models.

User-agent: ClaudeBot (formerly Anthropic-AI)
Documentation: Anthropic's official crawler documentation
Status: Actively crawling the web as of 2024

Common Crawl (AI Training Dataset)

CCBot: Common Crawl is a non-profit that crawls the web and provides free datasets. These datasets are used by almost every major AI company (OpenAI, Meta, Google, Anthropic) as training data.

User-agent: CCBot
Purpose: Building open web datasets for research and AI training
Impact: Even if you block GPTBot, your content might still end up in AI models if CCBot crawls and includes your site in their dataset

Meta (Facebook) AI Crawlers

Meta-ExternalAgent: Meta's crawler for AI training, including LLaMA models.

User-agent: facebookexternalhit (for link previews) and Meta-ExternalAgent (for AI training)
Purpose: Training Meta's AI models
Note: Some Meta crawlers are used for social media link previews—blocking those also blocks previews on Facebook

Amazon AI Crawlers

Amazonbot: Amazon's crawler for general web discovery. Increasingly used for AI and product intelligence.

User-agent: Amazonbot
Documentation: Amazon's official bot documentation
Respects robots.txt: Yes

📚 Keep This List Updated: New AI crawlers emerge regularly. Monitor tech news and your server logs to identify new User-agents that may need blocking.

How to Block All Major AI Crawlers

If you've decided to opt out of AI training data collection, here's a comprehensive robots.txt configuration to block the major crawlers.

Complete AI Blocking Configuration

# ============================================
# AI TRAINING CRAWLERS - COMPLETE BLOCKING CONFIGURATION
# Last updated: October 2024
# ============================================

# OpenAI - Primary training crawler
User-agent: GPTBot
Disallow: /

# OpenAI - Real-time ChatGPT user requests
User-agent: ChatGPT-User
Disallow: /

# Google - AI training (does NOT affect search ranking)
User-agent: Google-Extended
Disallow: /

# Anthropic (Claude AI) - Training crawler
User-agent: ClaudeBot
Disallow: /

# Anthropic (older User-agent - include for compatibility)
User-agent: Anthropic-AI
Disallow: /

# Common Crawl - Provides datasets used by multiple AI companies
User-agent: CCBot
Disallow: /

# Meta (Facebook) - AI training crawler
User-agent: Meta-ExternalAgent
Disallow: /

# Amazon - AI and product crawler
User-agent: Amazonbot
Disallow: /

# Apple - Siri and Spotlight AI (optional)
User-agent: Applebot
Disallow: /

# ByteDance (TikTok) - AI crawler
User-agent: Bytespider
Disallow: /

# ============================================
# NOTE: The following bots have legitimate uses
# Block only if you're sure
# ============================================

# Facebook link previews (blocking breaks social sharing)
# User-agent: facebookexternalhit
# Disallow: /

# Twitter/X card previews (blocking breaks social sharing)
# User-agent: Twitterbot
# Disallow: /

# LinkedIn previews (blocking breaks social sharing)
# User-agent: LinkedInBot
# Disallow: /

Explanation of Each Block

GPTBot & ChatGPT-User: Blocks both training and real-time access
Google-Extended: Optional but recommended if you want to opt out of Google's AI training
ClaudeBot/Anthropic-AI: Blocks Anthropic's crawler (use both for compatibility)
CCBot: Critical if you want to avoid your content appearing in Common Crawl datasets
Meta-ExternalAgent: Blocks Meta's AI training (separate from social previews)
Amazonbot & Bytespider: Blocks e-commerce and TikTok AI crawlers

Less Restrictive Alternative: Rate Limiting Instead of Blocking

If you don't want to completely block AI crawlers but want to reduce their server impact:

# Allow but throttle AI crawlers
User-agent: GPTBot
Crawl-delay: 10
# (Not allowing or disallowing - default is allow)

User-agent: CCBot
Crawl-delay: 5

This tells crawlers they can access your site, but only at a slow rate. Note: Not all AI crawlers support Crawl-delay.

⚠️ Important Limitation: Robots.txt blocking only affects future crawling. If your content has already been crawled and included in AI training datasets, blocking the crawler now won't remove existing data. You would need to contact each AI company directly to request data removal, a process that is often difficult and not guaranteed.

Should You Block AI Crawlers? The Pros and Cons

Before you rush to block every AI bot, consider these trade-offs. The decision depends on your business model, content strategy, and values.

Pros of Blocking AI Crawlers

Protect Intellectual Property: Prevent LLMs from reproducing your unique writing, research, analysis, or creative work without attribution or compensation.
Maintain Competitive Advantage: If your content is proprietary (research, data analysis, industry insights), giving it to AI models for free could train your competitors' AI tools.
Save Server Resources: AI crawlers can be aggressive. Blocking them saves bandwidth and server CPU. Some publishers report AI crawlers consuming 2-5% of their bandwidth.
Legal and Ethical Concerns: Many content creators object to having their work used to train commercial AI models without permission or payment. Blocking is a form of opting out.

Cons of Blocking AI Crawlers

Zero Visibility in AI Chatbots: If you block ChatGPT-User and a ChatGPT user asks "Summarize the latest article from [Your Website]," ChatGPT will respond that it cannot access the site. You lose potential brand exposure in the AI ecosystem.
Missed Referral Traffic: Some AI chatbots (Perplexity, You.com) provide citations and links back to sources. Blocking crawlers prevents these citations and potential referral traffic.
The Data Might Already Be Gone: If your site has been online for years, your content is likely already in Common Crawl datasets and baked into existing models (GPT-4, Claude 2, etc.). Blocking now only prevents future data collection.
Honor System Only: Robots.txt is a voluntary standard. Many aggressive AI scrapers, especially those from less reputable companies or individuals, ignore robots.txt entirely.

Decision Matrix: Should You Block?

Your Site Type	Recommendation	Reasoning
News/Journalism	Likely Block	Original reporting is valuable IP. Many news orgs are negotiating (or suing) over AI training.
Educational/Reference	Consider Carefully	AI chatbots could drive referral traffic from students, but may also summarize without linking.
E-commerce/Product	Likely Allow	Product descriptions are less unique; AI referrals could drive sales.
Personal Blog	Personal Choice	Depends on whether you want your writing to influence AI (allow) or keep it human-only (block).
Corporate/Brand Site	Likely Block	Brand voice and messaging are proprietary. You don't want AI mimicking your brand without control.

💡 Hybrid Approach: Consider blocking training crawlers (GPTBot) but allowing real-time crawlers (ChatGPT-User). This prevents your content from being permanently absorbed into models while still allowing potential referral traffic from AI chatbot users.

Legal Aspects: Is Blocking Enough?

Understanding the legal landscape around AI crawling is essential for publishers.

Robots.txt as a Legal Signal

Courts have historically considered robots.txt as expressing a website owner's consent preferences. In the 2000s, eBay successfully argued in court that ignoring robots.txt constituted trespass to chattels (unauthorized use of computer systems).

For AI training, several class-action lawsuits (including against OpenAI and Meta) argue that ignoring robots.txt to scrape content for AI training violates:

Terms of Service violations
Copyright infringement (for reproducing content in training data)
Computer Fraud and Abuse Act violations

Terms of Service vs. Robots.txt

Robots.txt is machine-readable; Terms of Service are human-readable. For stronger legal protection:

Include explicit prohibitions on AI training in your website's Terms of Service
Reference your robots.txt in your terms
Use both together for layered protection

Sample Terms of Service Clause

AI Training and Web Scraping
You may not use any automated system, including without limitation "robots," "spiders," or "offline readers," to access our website for the purpose of training artificial intelligence models or machine learning systems without our express written permission. Our robots.txt file expresses our consent preferences, and accessing our site in violation of those directives is strictly prohibited.

Legal Limitations of Robots.txt

Despite its legal recognition, robots.txt has limitations:

Not a law: Violating robots.txt is not automatically illegal everywhere
Requires enforcement: You would need to sue violators to enforce your preferences
International variation: Different countries have different laws regarding web scraping
Existing data: Once data is scraped, it's difficult to force deletion

⚖️ Disclaimer: This is not legal advice. Laws regarding AI training and web scraping are rapidly evolving. Consult with an attorney for advice specific to your situation and jurisdiction.

How to Monitor AI Crawler Activity

After implementing blocking rules, you should monitor your server logs to ensure AI crawlers are respecting your robots.txt.

Checking Server Logs

Search your access logs for AI crawler User-agent strings:

# Check for GPTBot
grep "GPTBot" /var/log/nginx/access.log

# Check for CCBot
grep "CCBot" /var/log/nginx/access.log

# Check for ClaudeBot
grep "ClaudeBot" /var/log/nginx/access.log

# Count requests by User-agent
awk '{print $12}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20

Tools for Monitoring AI Crawlers

Cloudflare Bot Management: Identifies and categorizes AI crawlers
Dark Visitors: A database of AI crawler User-agents with blocklists
Custom Log Analysis: Set up alerts for specific AI User-agents

What to Do If Crawlers Ignore Your Robots.txt

If you find AI crawlers are still accessing your site despite robots.txt blocking:

Verify your robots.txt is accessible: Check https://yourdomain.com/robots.txt returns 200 OK
Verify the User-agent matches: Some crawlers use slightly different strings (e.g., "GPTBot/1.0")
Contact the provider: OpenAI, Google, and Anthropic have reporting mechanisms for ignored directives
Implement server-level blocking: Use firewall rules or CDN settings to block the IP ranges
Consider legal action: For persistent violators, consult an attorney

Server-Level Blocking Example (Nginx)

if ($http_user_agent ~* "GPTBot|CCBot|ClaudeBot|Anthropic-AI") {
    return 403;
}

This blocks the request entirely at the web server level, even before the request reaches your application.

Emerging AI Crawlers to Watch

The AI crawler landscape is changing rapidly. Here are emerging players to monitor:

Perplexity AI

User-agent: PerplexityBot (also respects robots.txt)

Perplexity is an AI answer engine that provides citations. Unlike ChatGPT, Perplexity prominently links back to sources and can drive significant referral traffic. Many publishers choose NOT to block Perplexity.

Cohere AI

User-agent: CohereBot

Cohere is an enterprise AI company that trains models for business applications. Less well-known than OpenAI but growing rapidly.

Midjourney (Image AI)

User-agent: MidjourneyBot

Trains on images. If you have original photography or artwork, you may want to block this crawler.

Diffbot (Web Data Extraction)

User-agent: Diffbot

Provides structured web data for AI applications. Some publishers block, others see it as a legitimate API-like service.

How to Stay Updated

Bookmark Dark Visitors (darkvisitors.com) for an up-to-date AI crawler database
Monitor Reddit r/SEO and Hacker News for discussions of new crawlers
Review your server logs weekly for unfamiliar User-agent strings
Set up alerts for new User-agents with moderate request volumes

Template for Regular Review

Create a monthly task to review and update your AI crawler blocklist:

Check Dark Visitors for newly identified AI crawlers
Search your server logs for the previous month's new User-agents
Research any unfamiliar User-agents that appear in volume
Update your robots.txt and server-level blocking rules
Test the updated configuration

🏆 Final Takeaway: AI crawlers represent a new frontier in web governance. Unlike traditional search engine crawlers which benefit publishers through referral traffic, AI crawlers take content without direct benefit to the publisher. A thoughtful strategy—balancing protection of IP against potential future AI referral traffic—is essential. Update your robots.txt today with the configuration above, then monitor and adapt as the landscape evolves.

Share Article

Sarah Jenkins

Content Strategist

Sarah is a seasoned digital publisher and brand storyteller specializing in content protection, metadata standards, and intellectual rights management.

Article Details

📅 PublishedAugust 30, 2024

⏱️ Read Time11 min read

📂 CategoryAI & SEO

#aiscrapers#gptbot#ccbot#robots.txtblock#openaiscraperru#google-extended

🤖

Ready to Generate Your Robots.txt?

Free Robots.txt Generator. Instantly build error-free directives and optimize search engine crawling for your website.

Start Generating Now →

How to Block AI Scrapers (GPTBot, CCBot, Google-Extended) Using Robots.txt

The Major AI Crawlers: Know Your Bots

OpenAI Crawlers

Google AI Crawlers

Anthropic Crawlers

Common Crawl (AI Training Dataset)

Meta (Facebook) AI Crawlers

Amazon AI Crawlers

How to Block All Major AI Crawlers

Complete AI Blocking Configuration

Explanation of Each Block

Less Restrictive Alternative: Rate Limiting Instead of Blocking

Should You Block AI Crawlers? The Pros and Cons

Pros of Blocking AI Crawlers

Cons of Blocking AI Crawlers

Decision Matrix: Should You Block?

Legal Aspects: Is Blocking Enough?

Robots.txt as a Legal Signal

Terms of Service vs. Robots.txt

Sample Terms of Service Clause

Legal Limitations of Robots.txt

How to Monitor AI Crawler Activity

Checking Server Logs

Tools for Monitoring AI Crawlers

What to Do If Crawlers Ignore Your Robots.txt

Server-Level Blocking Example (Nginx)

Emerging AI Crawlers to Watch

Perplexity AI

Cohere AI

Midjourney (Image AI)

Diffbot (Web Data Extraction)

How to Stay Updated

Template for Regular Review

Share Article

Sarah Jenkins

Article Details

Ready to Generate Your Robots.txt?

Tools

Popular Tools

Company