New: Real-time hallucination alerts are live. Learn more →

LLM Metrix logoLLM Metrix
Back to Knowledge Base
Concepts

Do AI Crawlers Respect robots.txt?

Most major AI crawlers honor robots.txt — but the picture is nuanced. Learn which bots to know, how to allow or block them, and the visibility trade-offs.

By Team @ LLM Metrix7 min read5 sections

Mostly, yes — the major, reputable AI crawlers honor robots.txt. OpenAI’s GPTBot, Google’s Google-Extended, Anthropic’s ClaudeBot, and PerplexityBot all document the directives they follow. But the picture has nuance: different bots serve different purposes, compliance is voluntary, and blocking a crawler has real visibility consequences. Here’s what you need to know.

How robots.txt works for AI crawlers

robots.txt is a voluntary standard: it tells well-behaved bots which paths they may access. Reputable AI companies publish named user agents and respect Disallow rules for them. For example, you can allow or block each bot independently:

# Block OpenAI's training crawler
User-agent: GPTBot
Disallow: /

# Allow Perplexity's crawler
User-agent: PerplexityBot
Allow: /

The key nuance: “voluntary” means compliance depends on the operator. Major, named crawlers honor it; obscure or bad-actor scrapers may not. See the AI crawlers guide for the current list of user agents.

Training vs. retrieval crawlers

Not all AI bots do the same thing, and the distinction matters for your decision:

  • Training crawlers (e.g., GPTBot, Google-Extended, ClaudeBot) gather data that may inform model training. Blocking these limits whether your content shapes what models learn.
  • Retrieval/answer crawlers (e.g., PerplexityBot, and search-grounding fetches) pull live pages to answer queries with citations. Blocking these removes you from those engines’ cited answers.

You can make different choices per bot — for instance, allow retrieval crawlers (for citations) while restricting training crawlers, if that fits your strategy.

The visibility trade-off

Blocking AI crawlers is a real lever, but it cuts both ways:

  • Block, and you protect content from being used — but you forfeit the AI visibility, citations, and referral traffic that come with being included.
  • Allow, and you gain visibility — your content can be retrieved, cited, and represented, at the cost of it being used by AI systems.

Most brands seeking AI visibility should allow the major crawlers and focus on being represented accurately, rather than blocking and becoming invisible. See does AI use my website.

Practical checklist

  • [ ] Review robots.txt for unintended blocks of AI user agents
  • [ ] Decide per-bot: allow retrieval crawlers if you want citations
  • [ ] Confirm key content isn’t hidden behind scripts retrieval can’t read
  • [ ] Consider an llms.txt to guide AI to your best content
  • [ ] Re-check after site migrations, which often reset crawler rules

Frequently Asked Questions

Do AI crawlers respect robots.txt?

The major reputable ones do — GPTBot, Google-Extended, ClaudeBot, and PerplexityBot all document and honor robots.txt directives. Compliance is voluntary, however, so obscure or bad-actor scrapers may ignore it.

How do I block AI crawlers?

Add Disallow rules for the specific user agents in your robots.txt (for example, User-agent: GPTBot then Disallow: /). You can block or allow each bot independently depending on your strategy.

Should I block AI crawlers?

Usually not, if you want AI visibility. Blocking removes you from AI answers, citations, and referral traffic. Most brands benefit from allowing the major crawlers and focusing on being represented accurately instead.

What’s the difference between training and retrieval crawlers?

Training crawlers gather data that may inform model training, while retrieval crawlers fetch live pages to answer queries with citations. Blocking training crawlers limits your influence on what models learn; blocking retrieval crawlers removes you from cited AI answers.

Was this helpful?

Ready to put this into practice?

Apply these concepts with our step-by-step tutorials or check your visibility now.