How to Block AI Scrapers (GPTBot, CCBot, Google-Extended) Using Robots.txt
AI & SEOđź“– 11 min readđź“… August 30, 2024

How to Block AI Scrapers (GPTBot, CCBot, Google-Extended) Using Robots.txt

Sarah Jenkins
Sarah Jenkins
Content Strategist

The Major AI Crawlers: Know Your Bots

To block AI bots, you need to know their User-agent strings. Here are the most common ones currently scouring the web for training data, along with their purposes and documentation sources.

OpenAI Crawlers

GPTBot: OpenAI's primary web crawler used to collect data for training GPT-4, GPT-5, and future models. It respects robots.txt and identifies itself clearly.

  • User-agent: GPTBot
  • Documentation: OpenAI's official GPTBot documentation
  • Default behavior: Respects robots.txt; can be blocked individually

ChatGPT-User: This bot is different from GPTBot. It's activated when a ChatGPT Plus user uses the "Browse with Bing" feature to ask ChatGPT to read and summarize a specific URL. Blocking this bot prevents real-time summarization of your content in ChatGPT responses.

  • User-agent: ChatGPT-User
  • Purpose: Real-time content access for ChatGPT users
  • Note: Blocking this does NOT affect GPTBot

Google AI Crawlers

Google-Extended: Announced in September 2023, this is Google's dedicated crawler for AI training. Critically, blocking Google-Extended does NOT affect your inclusion in Google Search results—those use separate crawlers (Googlebot).

  • User-agent: Google-Extended
  • Purpose: Training Google's Vertex AI, Gemini, and other AI models
  • Key distinction: Completely separate from search indexing

Anthropic Crawlers

Anthropic-AI: The crawler for Anthropic (creators of Claude). Used to train Claude AI models.

  • User-agent: ClaudeBot (formerly Anthropic-AI)
  • Documentation: Anthropic's official crawler documentation
  • Status: Actively crawling the web as of 2024

Common Crawl (AI Training Dataset)

CCBot: Common Crawl is a non-profit that crawls the web and provides free datasets. These datasets are used by almost every major AI company (OpenAI, Meta, Google, Anthropic) as training data.

  • User-agent: CCBot
  • Purpose: Building open web datasets for research and AI training
  • Impact: Even if you block GPTBot, your content might still end up in AI models if CCBot crawls and includes your site in their dataset

Meta (Facebook) AI Crawlers

Meta-ExternalAgent: Meta's crawler for AI training, including LLaMA models.

  • User-agent: facebookexternalhit (for link previews) and Meta-ExternalAgent (for AI training)
  • Purpose: Training Meta's AI models
  • Note: Some Meta crawlers are used for social media link previews—blocking those also blocks previews on Facebook

Amazon AI Crawlers

Amazonbot: Amazon's crawler for general web discovery. Increasingly used for AI and product intelligence.

  • User-agent: Amazonbot
  • Documentation: Amazon's official bot documentation
  • Respects robots.txt: Yes
📚 Keep This List Updated: New AI crawlers emerge regularly. Monitor tech news and your server logs to identify new User-agents that may need blocking.

How to Block All Major AI Crawlers

If you've decided to opt out of AI training data collection, here's a comprehensive robots.txt configuration to block the major crawlers.

Complete AI Blocking Configuration

# ============================================
# AI TRAINING CRAWLERS - COMPLETE BLOCKING CONFIGURATION
# Last updated: October 2024
# ============================================

# OpenAI - Primary training crawler
User-agent: GPTBot
Disallow: /

# OpenAI - Real-time ChatGPT user requests
User-agent: ChatGPT-User
Disallow: /

# Google - AI training (does NOT affect search ranking)
User-agent: Google-Extended
Disallow: /

# Anthropic (Claude AI) - Training crawler
User-agent: ClaudeBot
Disallow: /

# Anthropic (older User-agent - include for compatibility)
User-agent: Anthropic-AI
Disallow: /

# Common Crawl - Provides datasets used by multiple AI companies
User-agent: CCBot
Disallow: /

# Meta (Facebook) - AI training crawler
User-agent: Meta-ExternalAgent
Disallow: /

# Amazon - AI and product crawler
User-agent: Amazonbot
Disallow: /

# Apple - Siri and Spotlight AI (optional)
User-agent: Applebot
Disallow: /

# ByteDance (TikTok) - AI crawler
User-agent: Bytespider
Disallow: /

# ============================================
# NOTE: The following bots have legitimate uses
# Block only if you're sure
# ============================================

# Facebook link previews (blocking breaks social sharing)
# User-agent: facebookexternalhit
# Disallow: /

# Twitter/X card previews (blocking breaks social sharing)
# User-agent: Twitterbot
# Disallow: /

# LinkedIn previews (blocking breaks social sharing)
# User-agent: LinkedInBot
# Disallow: /

Explanation of Each Block

  • GPTBot & ChatGPT-User: Blocks both training and real-time access
  • Google-Extended: Optional but recommended if you want to opt out of Google's AI training
  • ClaudeBot/Anthropic-AI: Blocks Anthropic's crawler (use both for compatibility)
  • CCBot: Critical if you want to avoid your content appearing in Common Crawl datasets
  • Meta-ExternalAgent: Blocks Meta's AI training (separate from social previews)
  • Amazonbot & Bytespider: Blocks e-commerce and TikTok AI crawlers

Less Restrictive Alternative: Rate Limiting Instead of Blocking

If you don't want to completely block AI crawlers but want to reduce their server impact:

# Allow but throttle AI crawlers
User-agent: GPTBot
Crawl-delay: 10
# (Not allowing or disallowing - default is allow)

User-agent: CCBot
Crawl-delay: 5

This tells crawlers they can access your site, but only at a slow rate. Note: Not all AI crawlers support Crawl-delay.

⚠️ Important Limitation: Robots.txt blocking only affects future crawling. If your content has already been crawled and included in AI training datasets, blocking the crawler now won't remove existing data. You would need to contact each AI company directly to request data removal, a process that is often difficult and not guaranteed.

Should You Block AI Crawlers? The Pros and Cons

Before you rush to block every AI bot, consider these trade-offs. The decision depends on your business model, content strategy, and values.

Pros of Blocking AI Crawlers

  • Protect Intellectual Property: Prevent LLMs from reproducing your unique writing, research, analysis, or creative work without attribution or compensation.
  • Maintain Competitive Advantage: If your content is proprietary (research, data analysis, industry insights), giving it to AI models for free could train your competitors' AI tools.
  • Save Server Resources: AI crawlers can be aggressive. Blocking them saves bandwidth and server CPU. Some publishers report AI crawlers consuming 2-5% of their bandwidth.
  • Legal and Ethical Concerns: Many content creators object to having their work used to train commercial AI models without permission or payment. Blocking is a form of opting out.

Cons of Blocking AI Crawlers

  • Zero Visibility in AI Chatbots: If you block ChatGPT-User and a ChatGPT user asks "Summarize the latest article from [Your Website]," ChatGPT will respond that it cannot access the site. You lose potential brand exposure in the AI ecosystem.
  • Missed Referral Traffic: Some AI chatbots (Perplexity, You.com) provide citations and links back to sources. Blocking crawlers prevents these citations and potential referral traffic.
  • The Data Might Already Be Gone: If your site has been online for years, your content is likely already in Common Crawl datasets and baked into existing models (GPT-4, Claude 2, etc.). Blocking now only prevents future data collection.
  • Honor System Only: Robots.txt is a voluntary standard. Many aggressive AI scrapers, especially those from less reputable companies or individuals, ignore robots.txt entirely.

Decision Matrix: Should You Block?

Your Site Type Recommendation Reasoning
News/Journalism Likely Block Original reporting is valuable IP. Many news orgs are negotiating (or suing) over AI training.
Educational/Reference Consider Carefully AI chatbots could drive referral traffic from students, but may also summarize without linking.
E-commerce/Product Likely Allow Product descriptions are less unique; AI referrals could drive sales.
Personal Blog Personal Choice Depends on whether you want your writing to influence AI (allow) or keep it human-only (block).
Corporate/Brand Site Likely Block Brand voice and messaging are proprietary. You don't want AI mimicking your brand without control.
đź’ˇ Hybrid Approach: Consider blocking training crawlers (GPTBot) but allowing real-time crawlers (ChatGPT-User). This prevents your content from being permanently absorbed into models while still allowing potential referral traffic from AI chatbot users.

How to Monitor AI Crawler Activity

After implementing blocking rules, you should monitor your server logs to ensure AI crawlers are respecting your robots.txt.

Checking Server Logs

Search your access logs for AI crawler User-agent strings:

# Check for GPTBot
grep "GPTBot" /var/log/nginx/access.log

# Check for CCBot
grep "CCBot" /var/log/nginx/access.log

# Check for ClaudeBot
grep "ClaudeBot" /var/log/nginx/access.log

# Count requests by User-agent
awk '{print $12}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20

Tools for Monitoring AI Crawlers

  • Cloudflare Bot Management: Identifies and categorizes AI crawlers
  • Dark Visitors: A database of AI crawler User-agents with blocklists
  • Custom Log Analysis: Set up alerts for specific AI User-agents

What to Do If Crawlers Ignore Your Robots.txt

If you find AI crawlers are still accessing your site despite robots.txt blocking:

  1. Verify your robots.txt is accessible: Check https://yourdomain.com/robots.txt returns 200 OK
  2. Verify the User-agent matches: Some crawlers use slightly different strings (e.g., "GPTBot/1.0")
  3. Contact the provider: OpenAI, Google, and Anthropic have reporting mechanisms for ignored directives
  4. Implement server-level blocking: Use firewall rules or CDN settings to block the IP ranges
  5. Consider legal action: For persistent violators, consult an attorney

Server-Level Blocking Example (Nginx)

if ($http_user_agent ~* "GPTBot|CCBot|ClaudeBot|Anthropic-AI") {
    return 403;
}

This blocks the request entirely at the web server level, even before the request reaches your application.

Emerging AI Crawlers to Watch

The AI crawler landscape is changing rapidly. Here are emerging players to monitor:

Perplexity AI

User-agent: PerplexityBot (also respects robots.txt)

Perplexity is an AI answer engine that provides citations. Unlike ChatGPT, Perplexity prominently links back to sources and can drive significant referral traffic. Many publishers choose NOT to block Perplexity.

Cohere AI

User-agent: CohereBot

Cohere is an enterprise AI company that trains models for business applications. Less well-known than OpenAI but growing rapidly.

Midjourney (Image AI)

User-agent: MidjourneyBot

Trains on images. If you have original photography or artwork, you may want to block this crawler.

Diffbot (Web Data Extraction)

User-agent: Diffbot

Provides structured web data for AI applications. Some publishers block, others see it as a legitimate API-like service.

How to Stay Updated

  • Bookmark Dark Visitors (darkvisitors.com) for an up-to-date AI crawler database
  • Monitor Reddit r/SEO and Hacker News for discussions of new crawlers
  • Review your server logs weekly for unfamiliar User-agent strings
  • Set up alerts for new User-agents with moderate request volumes

Template for Regular Review

Create a monthly task to review and update your AI crawler blocklist:

  1. Check Dark Visitors for newly identified AI crawlers
  2. Search your server logs for the previous month's new User-agents
  3. Research any unfamiliar User-agents that appear in volume
  4. Update your robots.txt and server-level blocking rules
  5. Test the updated configuration
🏆 Final Takeaway: AI crawlers represent a new frontier in web governance. Unlike traditional search engine crawlers which benefit publishers through referral traffic, AI crawlers take content without direct benefit to the publisher. A thoughtful strategy—balancing protection of IP against potential future AI referral traffic—is essential. Update your robots.txt today with the configuration above, then monitor and adapt as the landscape evolves.

Share Article

Sarah Jenkins

Sarah Jenkins

Content Strategist

Sarah is a seasoned digital publisher and brand storyteller specializing in content protection, metadata standards, and intellectual rights management.

Article Details

đź“… PublishedAugust 30, 2024
⏱️ Read Time11 min read
đź“‚ CategoryAI & SEO
#aiscrapers#gptbot#ccbot#robots.txtblock#openaiscraperru#google-extended
🤖

Ready to Generate Your Robots.txt?

Free Robots.txt Generator. Instantly build error-free directives and optimize search engine crawling for your website.

Start Generating Now →