Is AI Training on Your Website? How to Find Out in 30 Seconds

You published something on your website this morning. A blog post, a product page, a case study, whatever it was. By the time you hit publish, there's a good chance one or more AI bots had already scheduled a visit.

There are over 30 known AI crawlers active on the web right now. They work around the clock, visiting millions of pages per day, collecting content to train the next generation of AI models or to power real-time AI search engines. Most of them never asked for your permission. Most of them never send any traffic back to your site.

And unless you've specifically set up rules to control them, your website is wide open.

What AI bots actually do with your content

Not all AI bots are doing the same thing. It helps to understand the three main categories.

Training crawlers collect your content and feed it into datasets used to build or improve AI models. When GPTBot visits your blog, that post might become part of the training data for the next version of ChatGPT. Your words get absorbed into the model permanently. You can't undo it. You don't get credited. You don't get paid. The content is just... in there.

Search crawlers fetch your pages in real time when someone asks an AI assistant a question about your topic. These bots are more like traditional search engines. They find your content, summarize or cite it, and sometimes link back to you. This is closer to the old Google model, where crawling actually benefits you through visibility.

Bulk scrapers copy your content at scale for datasets, research, or competitive intelligence. These operate more aggressively and are less likely to respect any rules you set.

The real-world impact of unchecked AI crawling is already showing. Tailwind CSS laid off 75% of its engineering team. The framework is more popular than ever, but documentation traffic dropped 40% because AI coding assistants now generate Tailwind code directly. Users never visit the site anymore. Revenue fell 80%.

That's an extreme case, but the pattern is the same everywhere. AI bots take your content, users get answers from AI instead of visiting your site, and your traffic drops.

The 90% problem

We crawled over a million websites across seven different AI policy standards to understand how the web is handling this. The results were stark.

90.1% of websites have no machine-readable AI policy at all. No AI-specific rules in robots.txt. No ai.txt file. No TDMRep headers. Nothing that tells an AI bot what it can and can't do.

That means if you haven't specifically configured your site for AI crawlers, you're in the overwhelming majority. And your content is effectively public domain for AI training purposes.

Find out in 30 seconds

Check your domain against 7 AI policy standards. See exactly what's exposed.

Scan your site free

The gap your lawyer doesn't know about

Here's the part that surprises most people.

Your website probably has a Terms of Service page. And that page probably includes language about unauthorized access, automated scraping, data mining, or something similar. Your legal team put it there for a reason.

But AI bots don't read your Terms of Service page. They check your robots.txt file. And if your robots.txt has no AI-specific rules, the bot sees a green light.

We found 7,575 websites where this exact disconnect exists. The Terms of Service explicitly prohibit scraping, but the robots.txt file allows it. The legal rules are there. The technical enforcement is not.

Your lawyer wrote the policy. Nobody implemented it.

How to check (it takes 30 seconds)

This is the easy part.

Go to maango.io. Type in your domain. Hit scan. In a few seconds, you'll see a breakdown of your site's AI exposure across seven standards. Specifically, you'll see whether AI bots can train on your content, whether they can scrape your pages, whether they can use your site for AI search results, and whether there are any conflicts between your different policy signals.

The scan checks: robots.txt AI directives, ai.txt, llms.txt, TDMRep, HTML meta tags, HTTP headers, and Terms of Service. It's free, it doesn't require a signup, and the results are immediate.

If you see green across the board, your site has protection in place. If you see red, you have work to do. And if you see a conflict flag, it means your site is sending mixed signals to AI bots, which is arguably worse than having no policy at all.

What to do about it

If your scan shows you're unprotected, you have a few options.

The quick fix: Add AI-specific rules to your robots.txt file. We wrote a complete guide with copy-paste templates: How to Block AI Bots From Your Website Without Breaking Your SEO. This takes about 10 minutes and covers the most common training crawlers.

The comprehensive fix: Use our policy creation wizard to generate a complete AI policy for your site. You pick what kind of site you run, set your preferences for training, search, inference, and scraping, and we generate all the policy files you need. Takes about two minutes.

The “I'll deal with it later” option: At minimum, just scan your site so you know where you stand. Awareness is the first step. You might discover your site is more exposed than you expected, or you might discover you have protections you didn't know about.

The comprehensive fix

Generate robots.txt, ai.txt, agent-permissions.json, and TDMRep in 2 minutes. Free.

Build your AI policy

The first step is knowing

AI is not going to stop crawling the web. The bots will get more numerous and more sophisticated. The question isn't whether AI interacts with your content. It's whether you have any say in how.

Scan your site. See what AI can do on it. Then decide what you want.