Every major AI engine runs a web crawler to build and refresh the training data and retrieval indices that power their answers. Understanding which crawlers hit your site, how to identify them, and how to configure access correctly is foundational technical work for AI visibility.
The major AI crawlers
| Engine | Crawler name | User-agent string | Verified by IP |
|---|---|---|---|
| OpenAI / ChatGPT | GPTBot | GPTBot |
Yes (documented IP ranges) |
| Anthropic / Claude | ClaudeBot | ClaudeBot |
Yes |
| Perplexity | PerplexityBot | PerplexityBot |
Yes |
| Google Gemini / AI Overviews | Google-Extended | Google-Extended |
Yes (via Google IP ranges) |
| Meta AI | meta-externalagent | meta-externalagent |
Partial |
| Common Crawl | CCBot | CCBot |
Yes |
| You.com | YouBot | YouBot |
Partial |
Each engine also typically crawls using its standard search crawler (Googlebot, Bingbot) which feeds retrieval data to AI products. The crawlers listed above are dedicated AI training and retrieval crawlers separate from the search crawlers.
How to control AI crawler access in robots.txt
The robots.txt Robots Exclusion Protocol applies to AI crawlers just as it does to search crawlers. You can allow or disallow specific crawlers using the User-agent directive.
Allowing all AI crawlers (recommended default)
If your robots.txt has a wildcard Disallow: / with specific exceptions, you must explicitly allow AI crawlers or they will be blocked:
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: meta-externalagent
Allow: /
This is the most common source of accidental AI crawler blocking. A wildcard block with only Googlebot and Bingbot exceptions — a pattern common in older robots.txt configurations — silently blocks all AI crawlers.
Blocking a specific AI crawler
If you want to prevent a specific engine from crawling (for example, if you don’t want to contribute to a specific AI product’s training data):
User-agent: GPTBot
Disallow: /
Note: blocking a crawler affects both training data and live retrieval. If GPTBot can’t crawl your site, ChatGPT with real-time web access will also have degraded access to your pages.
Blocking AI crawlers from specific sections
You can limit AI crawler access to specific directories — for example, blocking a private documentation section while allowing public pages:
User-agent: GPTBot
Disallow: /docs/internal/
Disallow: /admin/
Allow: /
llm.txt — the AI-native robots.txt
llm.txt is an emerging standard (analogous to robots.txt) that provides AI agents and crawlers with a structured summary of your site: who you are, what your site contains, and which pages are most important for different types of queries.
Publish your llm.txt file at https://yourdomain.com/llm.txt. A minimal example:
# LLM Context for [Brand Name]
## About
[Brand] is a [category] platform that [core function] for [audience].
## Important pages
- Homepage: https://yourdomain.com
- Product: https://yourdomain.com/product
- Pricing: https://yourdomain.com/pricing
- Documentation: https://yourdomain.com/docs
## Key facts
- Founded: [year]
- Pricing: starts at $X/month
- Supported platforms: [list]
llm.txt is not yet universally supported across all AI crawlers, but GPTBot and ClaudeBot have documented support. Publishing it now means you’re indexed correctly as adoption grows.
How AI crawlers differ from search crawlers
Crawl purpose. Search crawlers build an index for ranking and retrieval. AI crawlers serve two separate functions: building pre-training datasets (less frequent, large-scale crawls) and building retrieval indices for RAG-powered live responses (more frequent, targeted crawls). Most brands are primarily optimizing for the retrieval index.
Freshness sensitivity. AI retrieval indices need fresh content to provide accurate citations. Crawl frequency for retrieval is typically higher than for training data — but still measured in days, not minutes. dateModified schema and a clean sitemap help retrieval crawlers prioritize your updated pages.
Content extraction. AI crawlers extract natural language content more aggressively than search crawlers. JavaScript-rendered content, content behind paywalls, and content locked behind login walls may not be accessible. Ensure your most important public-facing content is server-rendered and accessible without authentication.
Diagnosing AI crawler access issues
Check your server logs. Filter access logs for the user-agent strings listed above. Zero hits from GPTBot or ClaudeBot over a 30-day period on a public-facing site almost always indicates a blocking issue.
Test your robots.txt. Use Google Search Console’s robots.txt tester (it tests any user-agent, not just Googlebot). Check each AI crawler user-agent against your current configuration.
Check your CDN and WAF rules. Cloudflare, Fastly, and similar platforms often have bot blocking rules that catch AI crawlers before they reach your origin. Check your WAF configuration for rules that block non-search-engine bots — these frequently capture legitimate AI crawlers.
Verify IP ownership. Malicious bots sometimes spoof legitimate user-agent strings. OpenAI, Anthropic, and Google publish their crawler IP ranges — verify that crawl traffic claiming to be GPTBot or ClaudeBot is actually coming from those documented ranges.
After fixing crawler access
Once you’ve confirmed AI crawlers can access your site, use LLM Metrix to verify that the fix translates into improved citation rates. Track whether:
- Your pages start appearing more consistently in AI-generated answers for target queries
- Citation intelligence shows your URLs being retrieved rather than competitors’ URLs
- The accuracy of AI descriptions of your brand improves (a sign that crawlers are reading your updated content)
Expect 2–4 weeks for retrieval index updates to propagate after a crawl access fix, depending on how frequently each engine re-indexes your domain.