State of AI Agent Policies - 2026

We crawled 1 million domains. 90% have no AI policy.

The first comprehensive study of how the web governs AI agents. We parsed 8 competing standards across the Tranco top 1M domains to map the permissions landscape that every AI agent navigates - mostly blindly.

999,316 domains analyzed8 standards parsedFeb 12–22, 2026By Maango
90.1%
No machine-readable AI policy
4.8%
Explicitly block all AI agents
6.9%
Block GPTBot (most-blocked bot)
2.6%
Have comprehensive AI policies

01 - Key Findings

The web is not ready for AI agents.

Between February 12 and 22, 2026, we crawled 999,316 domains from the Tranco top 1M list, parsing every known AI policy standard: robots.txt AI directives, llms.txt, ai.txt, TDMRep, Cloudflare Content Signals, meta tags, and others. What we found is a governance vacuum.

9 in 10 domains have no machine-readable AI policy at all. Of the 10% that do express a stance, most rely solely on robots.txt - a 30-year-old protocol that was never designed for AI and has no concept of “training” vs. “search” vs. “summarization.”

Only 2.6% of all domains have what we'd consider a comprehensive policy - signals across multiple standards with clear rules for different AI use cases. The rest of the web is either silent or using blunt instruments.

Meanwhile, the stakes are rising. Anthropic has agreed to a $1.5B copyright settlement. The EU Copyright Directive makes TDM opt-out signals legally binding. OpenAI stopped honoring robots.txt for ChatGPT-User in December 2025. The gap between what AI agents are doing and what websites have said about it is enormous - and growing.

The core problem: There are 8 competing standards, near-zero interoperability, and almost no adoption of anything beyond robots.txt. AI agents are navigating a web that hasn't decided how to talk to them.

02 - Methodology

How we collected this data

Transparency matters. Every number in this report is derived from a direct crawl of the Tranco top 1M list - a research-grade domain ranking based on aggregated DNS usage data from Cloudflare Radar, Farsight DNSDB, and other sources. Unlike Alexa (discontinued 2022), Tranco is designed to resist manipulation and is used widely in academic security research.

Source List
Tranco Top 1M (Feb 2026)
Domains Crawled
999,316 of 1,000,000
Crawl Period
Feb 12–22, 2026
Failure Rate
0.07% (697 domains)
Standards Parsed
8 (robots.txt, llms.txt, ai.txt, TDMRep, Content Signals, meta tags, agents.json, UCP)

For each domain, we fetched and parsed every known AI policy signal, extracted per-bot directives from robots.txt, checked for llms.txt and ai.txt files, inspected HTTP headers and meta tags for Content Signals and TDM headers, and computed an openness score (0–100) and stance classification. All raw data is stored in Supabase and queryable through the Maango API.

What we did not do: We did not attempt to interpret Terms of Service documents (this is planned for a future crawl using LLM analysis). We did not crawl subdomains or subpages - each entry represents the root domain only. We did not contact domain owners or make editorial judgments about intent.

03 - Stance Distribution

What the web says about AI agents

We classified every domain into one of six categories based on the signals we found. The result is stark: the vast majority of the web says nothing at all.

AI Stance Distribution - All 999,316 Domains

No Policy (87.9%) + Wildcard Block (2.2%) = 90.1% of domains with no intentional AI-specific policy. Source: Maango crawl of Tranco Top 1M, Feb 2026

StanceDefinitionDomainsShare
No PolicyNo machine-readable AI-specific signal found878,37687.9%
Blocks All AIExplicitly targets known AI bots by user-agent name47,6974.8%
Wildcard BlockBroad Disallow: * rules that block AI bots as a side effect, not by name22,3602.2%
SelectiveAllows some AI use cases, blocks others35,0163.5%
Allows AllExplicitly permits all AI access8,6430.9%
Blocks Training OnlyAllows search/browsing but blocks training7,2240.7%

Defining “No Policy”: 58% of domains have a robots.txt file - so how can 88% have “no policy”? Because most robots.txt files contain only generic crawl rules (like Disallow: /admin/) that predate AI entirely. We classify a domain as having an AI policy only if it contains AI-specific signals: rules targeting known AI bots (GPTBot, ClaudeBot, etc.), AI-related meta tags, llms.txt, TDMRep headers, or Content Signals. A further 2.2% use broad wildcard rules that block all bots - including AI crawlers - as a side effect, bringing the combined total to 90%.

The “No Policy” category deserves scrutiny. Having no AI-specific signals doesn't mean a domain wants to be crawled - it means they haven't expressed a preference in any machine-readable format an AI agent can interpret. In many cases, especially for smaller sites, the owner may not even be aware these standards exist.

The nuance gap: Only 4.4% of all domains have what you might call a “considered” AI policy - one that differentiates between use cases (selective) or explicitly opts in (allows all). The other 95.6% either say nothing or use a blanket block. The web is mostly binary: silent or locked.

04 - Signal Adoption

Which standards are actually being used?

There are 8 competing standards for expressing AI permissions. In practice, one dominates and the rest barely register.

Signal Adoption Across 999,316 Domains

Note: A domain may have multiple signals. "AI-specific robots.txt" means robots.txt with rules targeting known AI bots (GPTBot, ClaudeBot, etc.), distinct from generic robots.txt rules.

robots.txt remains the backbone. 58% of domains have a robots.txt file, but only 11.1% include rules specifically targeting AI bots. The rest have generic crawl directives that predate the AI era.

llms.txt is the leading “new” standard at 3.24% adoption - roughly 32,400 domains. It's been adopted by major players including Adobe, Shopify, Stripe, Salesforce, Nvidia, and Dropbox. But llms.txt is an information file, not a permissions mechanism - it tells LLMs what content is available and how it's structured, not what they're allowed to do with it.

Cloudflare Content Signals reach 3.48% - but this warrants a caveat. Nearly all Content Signals domains use the identical pattern: search=yes, ai-train=no. This uniformity strongly suggests these are Cloudflare platform defaults applied automatically to customer sites, not deliberate policy choices by individual domain owners.

TDMRep is virtually non-existent at 9 domains out of 1 million. This is significant because the EU Copyright Directive explicitly references TDM reservation as the mechanism for rights holders to opt out of AI training. The standard that EU law points to has near-zero adoption.

Signal Adoption by Domain Popularity

SignalTop 1KTop 10KTop 100KAll 1M
AI-specific robots.txt17.7%15.5%12.5%11.1%
llms.txt5.4%3.8%3.4%3.2%
Content Signals0.1%1.7%2.6%3.6%
ai.txt0.1%0.1%0.1%0.1%
TDMRep0.0%0.0%0.0%0.001%

An interesting pattern emerges: the most popular sites are more likely to use robots.txt AI rules and llms.txt (they have the engineering resources), while Content Signals adoption increases further down the list (driven by Cloudflare's broad install base among smaller sites).

05 - The Blocking Landscape

GPTBot is the most blocked bot on the internet.

When domains do block AI agents, they rarely discriminate. But the order of blocking reveals market dynamics and publisher sentiment.

Top 10 Most-Blocked AI Bots

% of all 999,316 domains that explicitly disallow each bot in robots.txt

GPTBot leads at 6.9% - roughly 68,700 domains actively block OpenAI's crawler. ClaudeBot (Anthropic) follows at 6.1%, then Amazonbot (6.0%), Google-Extended (5.9%), and Applebot-Extended (~5.6%). Meta's bot appears under multiple user-agent strings across robots.txt files; counting Meta-ExternalAgent and its aliases together, 56,048 unique domains block Meta's crawler - placing it firmly in the top five.

The blocking is highly correlated. A precise 2×2 breakdown of GPTBot vs. ClaudeBot reveals: 58,791 domains block both, 9,888 block only GPTBot, 2,521 block only ClaudeBot, and 928,116 block neither. The asymmetry (more GPT-only than Claude-only) is partly explained by NULL handling in early robots.txt parsers that defaulted to disallowing unrecognised agents - but the dominant pattern is clear: blocking is overwhelmingly a blanket decision, not a targeted choice about specific companies.

ChatGPT-User vs. GPTBot: OpenAI uses two bots - GPTBot for training and ChatGPT-User for real-time browsing. GPTBot is blocked on 6.9% of domains; ChatGPT-User on only 2.9%. This gap of ~40,000 domains may represent sites that want to block training but not search - or, more likely, sites that haven't updated their robots.txt since ChatGPT-User was introduced.

Blocking Rates Among the Top 10K

The most popular sites block AI bots at roughly the same rates, with one exception: CCBot (Common Crawl) is blocked more heavily among top domains (7.4% vs 5.4% overall), likely because well-known publishers were early to block the training dataset that powered early LLMs.

06 - The Popularity Divide

Bigger sites are more restrictive.

We segmented domains by Tranco rank to see how AI policy adoption varies with site popularity.

AI Stance by Domain Popularity Tier

TierNo PolicyBlocks AllSelectiveAllows AllAvg Openness
Top 1K84.0%8.7%4.7%0.6%34.0
Top 10K86.4%6.4%4.7%0.9%34.8
Top 100K89.7%5.1%3.2%0.8%35.5
All 1M90.1%4.8%3.5%0.9%34.8

* “No Policy” in this table includes Wildcard Block domains (shown separately in the stance table above).

The top 1,000 domains block AI agents at 1.8x the overall rate (8.7% vs. 4.8%). They're also more likely to have selective policies (4.7% vs. 3.5%). This makes intuitive sense - high-value content sites have more to protect and more resources to configure policies.

Conversely, the long tail (ranks 100K–1M) is overwhelmingly silent. These are the sites where AI agents have the least guidance and the most latitude.

07 - Notable Domains

Who's blocking, who's open, and who's silent.

Top 100 Domains That Block All AI

facebook.com (#5)instagram.com (#12)twitter.com (#16)amazon.com (#24)netflix.com (#37)pinterest.com (#47)x.com (#58)

Social platforms and major consumer brands dominate the “blocks all” list. These companies have the most user-generated content and the most to lose from unauthorized AI training.

Top 100 Domains With Selective Policies

linkedin.com (#19)github.com (#31)tiktok.com (#64)yandex.ru (#82)

These sites differentiate between AI use cases - allowing some bots while blocking others, or permitting search access while blocking training crawlers.

Top 100 Domains With No Policy

google.com (#1)youtube.com (#9)microsoft.com (#3)apple.com (#8)wikipedia.org (#29)

Perhaps the most striking finding: the world's biggest websites - Google, YouTube, Microsoft, Apple, Wikipedia - have no AI-specific machine-readable policy. This doesn't mean they don't have AI-related terms in their ToS, but they haven't expressed those terms in any format that an AI agent can programmatically read and respect.

Wildcard Block Domains

reddit.com (#103)

A distinct category from “Blocks All AI”: these 22,360 domains use broad Disallow: * or blanket user-agent rules in robots.txt that happen to catch AI bots as a side effect - not because the owner explicitly targeted AI. Reddit is the highest-profile example: its robots.txt blocks most automated access broadly, but this predates AI crawler policy and cannot be read as a deliberate AI stance.

Most Restrictive Major Domains

The lowest openness scores (5-10 out of 100) among domains ranked in the top 10K are dominated by news publishers:

espn.com (#387)cnbc.com (#442)latimes.com (#777)nbcnews.com (#792)bbc.co.uk (#212)newsweek.com (#1,122)

Most Open Major Domains

The highest openness scores (95 out of 100) include cybersecurity vendors and platforms that benefit from broad AI indexing:

kaspersky.com (#150)sophos.com (#517)expedia.com (#2,273)lg.com (#2,621)

08 - Infrastructure Patterns

Your CDN and CMS shape your AI policy.

One of the more unexpected findings: the infrastructure a site runs on is a strong predictor of its AI stance.

CDN and AI Blocking

AI Blocking Rate by CDN Provider

% of domains on each CDN that block all AI agents

Cloudflare-hosted sites block AI agents at 11.3% - more than double the overall rate of 4.8%. This is almost certainly a product feature effect: Cloudflare launched a one-click “Block AI Bots” toggle in July 2025, making it trivially easy for site owners to opt out. When blocking is one click away, more people click.

At the other end, Vercel (1.3%), Akamai (1.2%), and Netlify (0.7%) sites are far more permissive. These platforms are developer-oriented and don't offer default AI blocking features. We verified that the Vercel and Cloudflare detection signals have only 40 domains of overlap, confirming that the CDN categories are effectively independent - the Cloudflare blocking spike is not an artefact of cross-CDN detection.

A note on CDN detection: We identify CDNs via HTTP response headers (Server, x-served-by, via, etc.) and CNAME records. This method has known limitations - Akamai in particular often does not expose identifying headers, which likely explains its low count (2,122) relative to its true market share.

CMS and AI Blocking

AI Blocking Rate and Avg Openness by CMS

CMSDomains% Block AllAvg Openness
Shopify22,3781.2%63.1
Gatsby1,7191.2%52.4
Webflow10,2512.6%57.5
Wix2,0533.2%21.6
Squarespace1,7583.4%21.9
Drupal9,9674.5%33.4
Ghost3,2014.8%40.1
Hugo2,5334.9%33.9
Next.js29,6185.4%42.9
WordPress150,6266.3%46.9

Shopify sites are the most open to AI (1.2% blocking, 63.1 average openness). This makes strategic sense - e-commerce sites want maximum discoverability, and AI-powered product search is a distribution channel, not a threat.

WordPress sites are the most restrictive CMS at 6.3% blocking. Ghost is popular with independent publishers and bloggers - the exact audience most concerned about AI content extraction.

Wix and Squarespace are anomalies with average openness scores of just 21.6 and 21.9 respectively - far below any other CMS. This likely reflects platform-level default settings rather than individual site owner decisions.

09 - Geographic Patterns

Where you are shapes how you feel about AI.

TLD analysis reveals significant geographic variation in AI policy adoption.

AI Blocking Rate by Country-Code TLD

Only TLDs with 100+ domains shown. Percentage = share of domains that block all AI agents.

The UK is among the most restrictive major markets - 4.4% of .uk domains block all AI agents, among the highest rates for major country-code TLDs. This aligns with the UK's active regulatory posture and its large publishing industry.

Japan and Iran are the most permissive at 1.1% and 0.9% respectively. Japanese domains are notably open, possibly reflecting a more AI-friendly regulatory environment and technology culture.

Russia is more open than Europe - .ru domains block at 1.2% vs. 3.2% for .de (Germany) and 4.1% for .fr (France). European domains are broadly more restrictive, likely influenced by the GDPR compliance culture and the EU Copyright Directive.

10 - Use Case Analysis

Training, search, and inference are blocked at similar rates.

We analyzed how domains treat three distinct AI use cases: training (building models on the content), search (indexing for retrieval), and inference (real-time reasoning over the content).

Policy Stance by AI Use Case

Use CaseBlockedAllowedNo PolicySelective
Training6.5%0.4%91.8%1.3%
Search6.1%0.8%91.6%1.5%
Inference6.6%0.4%93.0%0.0%

The differences are smaller than you might expect. Training and inference are blocked at almost identical rates (6.5% vs 6.6%). Search is slightly less blocked (6.1%) and more often selectively allowed (1.5% vs 1.3%).

The nuance gap is real: Only 0.7% of domains (those classified as “blocks training only”) differentiate between training and other use cases. The remaining 99.3% either block everything, allow everything, or say nothing. The sophisticated, purpose-aware AI policy that regulators envision is essentially absent from the web.

11 - Conflicts Between Standards

6,317 domains contradict themselves.

When multiple standards exist, they don't always agree. We detected 6,317 domains (0.63% of all, but 5.2% of domains that have any signals) where different policy sources express conflicting positions.

This is a structural problem, not an edge case. A site might block GPTBot in robots.txt but set ai-train=no, search=yes in Content Signals - one says “no access,” the other says “search is fine.” Which does the agent follow?

Today, each agent resolves this differently - or ignores it. There is no standard for conflict resolution. This is exactly the kind of ambiguity that creates legal liability for agent developers and frustration for website owners.

Conflicts will grow. As newer standards gain adoption, the probability of contradictions increases. An agent that only checks robots.txt will make different decisions than one that also reads Content Signals. A neutral aggregation layer that detects and resolves these conflicts becomes more valuable with every new standard.

12 - Implications

What this means.

For AI Agent Developers

You are almost certainly not checking permissions comprehensively. If your agent only reads robots.txt, you're seeing 1 of 8 signals and missing conflicts, purpose-specific rules, and newer standards that may carry legal weight. The 90% of domains with no policy are a gray zone - absence of a signal is not the same as permission. As regulatory enforcement tightens, “we didn't check” becomes increasingly untenable.

For Website Owners

If you haven't configured AI-specific signals, AI agents are making their own decisions about your content. The tools exist - robots.txt AI directives, llms.txt, Content Signals - but adoption is minimal. The gap between the legal right to opt out (especially under the EU Copyright Directive) and the practical expression of that right is enormous. Most opt-out mechanisms that regulators point to have near-zero adoption.

For Regulators

The EU Copyright Directive's TDM reservation mechanism is adopted by 9 domains out of 1 million. The standard that EU law references as the opt-out mechanism for AI training is functionally non-existent. Any regulatory framework that relies on machine-readable signals needs to grapple with the fact that the web has not adopted them - and may not without significant intervention or simplification.

For the Industry

Eight competing standards is not a governance framework - it's a fragmentation problem. The web needs convergence, not more proposals. Until that happens, a neutral aggregation layer that reads everything and presents a unified view is not optional infrastructure - it's the only way to make the current patchwork work.

Look up any domain's AI policy.

Maango crawls 8 standards, detects conflicts, and serves unified AI permissions through a single API call.

Get API access to this data

Launching Q1 2026 - join the waitlist for early access and free credits.