Blog/How-To

How to Block AI Bots From Your Website (Without Breaking Your SEO)

10 min read

There are over 30 known AI bots crawling the web right now. GPTBot, ClaudeBot, Bytespider, Google-Extended, CCBot, Meta-ExternalAgent... the list keeps growing every quarter. Most of them are collecting your content to train large language models. Some are powering AI search engines. A few are doing both.

And most websites block exactly zero of them.

We crawled over a million domains across seven AI policy standards. The finding? 90.1% have no machine-readable AI policy at all. No rules in their robots.txt for AI bots. No ai.txt. Nothing telling these crawlers what they can and can't do.

If you haven't specifically configured your site for AI crawlers, your content is almost certainly being used to train AI models right now. This guide walks through how to fix that without accidentally tanking your search rankings or disappearing from AI-powered search results.

The important distinction: training bots vs. search bots

Before you start blocking things, you need to understand that AI bots come in two very different flavors. Getting this wrong can hurt you.

Training bots crawl your site to collect content for building AI models. When GPTBot visits your page, that content may end up as training data for the next version of GPT. You get nothing in return. No traffic, no attribution, no compensation. The same goes for Google-Extended (Gemini training), CCBot (Common Crawl datasets used by nearly every AI lab), Bytespider (ByteDance), and others.

Search and retrieval bots fetch your pages in real time when a user asks an AI assistant a question. When someone asks ChatGPT “what does [your company] do?” and ChatGPT browses the web to answer, that request comes from ChatGPT-User. Similarly, PerplexityBot fetches your page to generate a cited answer in Perplexity's search results. These bots can actually send you traffic.

The smart move for most websites is to block training bots while allowing search bots. You protect your content from being absorbed into AI models permanently, but you stay visible in AI-powered search results.

The tricky part? Some bots do both. GPTBot is the classic example. OpenAI uses it for training AND for some browsing features. If you block GPTBot, you prevent training but also limit some real-time ChatGPT access. OpenAI has separated the crawlers (GPTBot for training, ChatGPT-User for browsing, OAI-SearchBot for search), but the lines aren't always clean.

Before you start blocking, check what's exposed

See which AI bots can currently access your site and where your gaps are.

Scan your site free

Step 1: Add AI-specific rules to your robots.txt

robots.txt is still the primary way to control AI bot access. It's a simple text file at the root of your website that tells crawlers which parts of your site they can visit. Most AI companies, including OpenAI, Anthropic, and Google, have publicly committed to respecting robots.txt directives.

Here's a robots.txt configuration that blocks training crawlers while keeping search bots active. Add these lines to the END of your existing robots.txt file. Do not replace your entire file, or you'll lose your existing rules for Googlebot and other search engines.

robots.txttxt
# =============================================
# AI Training Crawlers - BLOCKED
# These bots collect content for model training.
# =============================================

# OpenAI training crawler
User-agent: GPTBot
Disallow: /

# Anthropic training crawlers
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /

# Common Crawl (datasets used by OpenAI, Meta, Google, others)
User-agent: CCBot
Disallow: /

# ByteDance / TikTok AI training
User-agent: Bytespider
Disallow: /

# Google AI training (does NOT affect Google Search)
User-agent: Google-Extended
Disallow: /

# Apple Intelligence training
User-agent: Applebot-Extended
Disallow: /

# Meta AI training
User-agent: Meta-ExternalAgent
Disallow: /

# Cohere AI training
User-agent: cohere-ai
Disallow: /

# Allen Institute for AI
User-agent: AI2Bot
Disallow: /

# Other known training crawlers
User-agent: Diffbot
Disallow: /
User-agent: Omgilibot
Disallow: /
User-agent: img2dataset
Disallow: /
User-agent: Timpibot
Disallow: /
User-agent: FacebookBot
Disallow: /

# DeepSeek AI
User-agent: DeepSeekBot
Disallow: /

# xAI / Grok
User-agent: Grokbot
Disallow: /

A few important notes on this setup:

  • Blocking Google-Extended does NOT affect your Google Search rankings. Google-Extended only controls whether your content is used to train Gemini. Your regular Googlebot indexing is completely separate. Never block Googlebot itself.
  • Blocking GPTBot while allowing ChatGPT-User means your content won't be used for OpenAI model training, but ChatGPT can still browse your page when a user specifically asks it to. This is the configuration most websites should use.
  • The list of AI bots is growing constantly. New crawlers appear every few months. Check resources like Dark Visitors or CrawlerCheck periodically and update your robots.txt when new bots appear.

Skip the manual work

Our policy wizard generates these files for you. Pick your site type, set your preferences, download.

Build your AI policy

Step 2: Check for the “ToS gap”

Here's something most guides won't tell you. Updating your robots.txt is necessary but it might not be sufficient, depending on what the rest of your site says.

We found 7,575 websites where the Terms of Service explicitly prohibit scraping or automated data collection, but the robots.txt file has zero AI-specific rules. That means a bot checking robots.txt sees “nothing blocked, come on in” while the legal page says “absolutely not.”

This creates a gap. Your legal team wrote rules that your technical setup doesn't enforce. And under current legal frameworks, particularly the Van Buren Supreme Court ruling on the Computer Fraud and Abuse Act, courts focus on whether there was a technical barrier, not just a written policy.

Check whether your Terms of Service mention anything about scraping, automated access, data mining, or AI training. If they do, your robots.txt should match. Otherwise you have a policy that exists only on paper.

Step 3: Go beyond robots.txt (three tiers of protection)

robots.txt is the baseline, but it's not the only option. There are three tiers of AI bot protection, each with different tradeoffs. Most sites should start at Tier 1 and move up based on how much enforcement they need.

Tier 1: Policy files (robots.txt, ai.txt, agent-permissions.json)

This is what we covered in Step 1. You publish files that tell bots what your rules are. The bots choose whether to follow them.

Who respects it: OpenAI, Anthropic, Google, and most major AI companies have publicly committed to honoring robots.txt. That covers the biggest crawlers. The ai.txt and agent-permissions.json standards add more detail (distinguishing training from search, for example), and adoption is growing.

Limitations: It's voluntary. Bad actors ignore it entirely. Perplexity has been accused of bypassing robots.txt restrictions. Bytespider has been observed accessing blocked paths on test sites. User-agent strings can be spoofed, so a bot pretending to be something else will get through.

Bottom line: This is necessary but not sufficient on its own if you need hard enforcement. It covers the well-behaved majority. For most sites, that's enough.

You can generate all your policy files (robots.txt additions, ai.txt, agent-permissions.json, and TDMRep) in about two minutes using our free wizard.

Tier 2: Server-side blocking

If you have access to your web server configuration (nginx, Apache, Caddy, etc.), you can check the user-agent header on incoming requests and return a 403 Forbidden for known AI bots. The bot never gets your content. It's not a polite request anymore. It's actual enforcement.

Here's a quick nginx example:

nginx.confnginx
if ($http_user_agent ~* (GPTBot|ClaudeBot|CCBot|Bytespider)) {
    return 403;
}

Limitation: This still relies on user-agent strings, which can be spoofed. A bot that lies about its identity will get past this. But combined with Tier 1 policy files, you're covered against the vast majority of known AI crawlers.

Best for: Sites with server access (VPS, dedicated hosting, self-managed infrastructure) that want enforcement beyond policy files. Not available on most managed hosting platforms like Shopify or Squarespace.

Tier 3: CDN-level blocking (strongest)

Services like Cloudflare, Akamai, and DataDome sit between your server and all incoming traffic. They use ML-based fingerprinting and behavioral analysis to identify bots even when they spoof their user-agent. Cloudflare's AI Crawl Control offers one-click blocking for all AI bots, plus the option to charge bots for access.

This is the strongest enforcement currently available. The CDN sees the actual connection fingerprint, request patterns, and behavioral signals that a simple user-agent check can't detect.

Limitation: Requires your site to be behind that CDN. Cloudflare covers about 20% of the web. If you're on shared hosting, Vercel, Netlify, or bare metal without a CDN, you don't get this level of protection.

Cloudflare's AI Crawl Control is currently the strongest enforcement available. If your site is already behind Cloudflare, turn it on.

Which tier do you need?

Regardless of which tier you choose, the first step is the same: understand your current exposure. A CDN blocks bots at the network level, but it doesn't tell you whether your Terms of Service conflicts with your robots.txt, or whether your site sends mixed signals across different AI policy standards. That's what our scanner catches. Start there, then decide how much enforcement you need.

Step 4: Check your current AI exposure

Before making any changes, it helps to see where you actually stand. Are AI bots currently accessing your site? Which ones? Are there conflicts between your different policy signals?

We built a free scanner that checks any domain against seven AI policy standards in a few seconds. It shows you which bots can access your site, whether your ToS conflicts with your robots.txt, and where the gaps are.

A quick checklist before you go

Here's the short version of everything above:

  1. Add AI training bot rules to your robots.txt (use the template above).
  2. Keep AI search bots (ChatGPT-User, PerplexityBot, OAI-SearchBot) allowed unless you have a specific reason to block them.
  3. Check your Terms of Service for scraping/AI clauses and make sure your robots.txt matches.
  4. Scan your site to see your full AI exposure across all seven standards.
  5. Revisit every quarter. New bots appear regularly and the landscape changes fast.

Your content has value. Setting clear rules for how AI interacts with it isn't paranoia. It's basic digital hygiene in 2026.

Take action

Protect your content today

Scan your site to see your AI exposure, or build your policy in 2 minutes. Free, no signup required.