Blog · April 28, 2026 · 13 min read
The Local Business Owner's Guide to llms.txt, robots.txt, and AI Crawler Access
About a third of small-business sites we audit are accidentally blocking AI bots — through a security plugin, a Cloudflare toggle, or a robots.txt that someone added five years ago and forgot.
If your site is one of them, ChatGPT, Claude, and Perplexity will never recommend you, no matter how good your content is. They can't recommend what they can't read.
This guide covers every AI crawler you should know, the robots.txt and llms.txt templates that work, and how to test what AI actually sees when it visits your site. It's the most technical post on this blog, but every step is something a non-technical owner can run themselves with a browser and a terminal.
The two files that decide if AI can read your site
Two small text files, both at the root of your domain, control almost everything about how AI sees you:
- robots.txt — the decades-old standard for telling crawlers which parts of your site are off-limits. Required reading for every AI bot.
- llms.txt — a new (2024) standard for telling AI which pages on your site are most worth reading. Optional today, increasingly important.
They live at yourdomain.com/robots.txt and yourdomain.com/llms.txt. Anyone can fetch them. Open them in a browser right now and see what you have.
The 12 AI crawlers you should know
Every major AI company runs at least two crawlers: a training crawler that builds the model's background knowledge, and a live-user crawler that fetches your site in real time when someone asks a question. You almost always want both allowed.
GPTBot
OpenAI / ChatGPTBuilds ChatGPT’s knowledge base. Blocking it removes you from training data — meaning ChatGPT will never remember you.
ChatGPT-User
OpenAI / ChatGPTFetches your site when a real user asks ChatGPT a question that needs current info. The most important crawler for live recommendations.
OAI-SearchBot
OpenAI / SearchGPTPowers SearchGPT’s real-time search index. Allow if you want to appear in OpenAI’s search-style results.
ClaudeBot
Anthropic / ClaudeAnthropic’s primary training crawler. Strict about robots.txt — even a typo can lock it out.
Claude-User
Anthropic / ClaudeFetches your site when a Claude user asks something needing current data. Allow this even if you block ClaudeBot.
PerplexityBot
PerplexityIndexes the web for Perplexity’s answer engine. Citations from Perplexity show source links prominently to users.
Perplexity-User
PerplexityReal-time fetch when a user asks Perplexity a question. The companion to PerplexityBot.
Google-Extended
Google / GeminiControls whether your site is used to train Gemini. Independent of regular Googlebot — you can allow regular search and block training, or vice versa.
Applebot-Extended
Apple / Siri IntelligenceApple’s AI training opt-out. Allow it to be considered for Siri and Apple Intelligence answers.
CCBot
Common CrawlOpen-source web archive used by many AI training datasets, including Mistral, Llama, and dozens of smaller models.
MistralAI-User
Mistral / Le ChatFetches when a Le Chat user needs live web data. Smaller volume than the major three, but growing in Europe.
Meta-ExternalAgent
Meta / Meta AIMeta’s AI crawler for Llama models and Meta AI search. Independent of Facebook and Instagram crawlers.
The robots.txt template that works
For most local businesses, the right move is allow everything. Here's the copy-paste-ready robots.txt that explicitly opts in to all major AI crawlers:
# Allow major AI crawlers User-agent: GPTBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: OAI-SearchBot Allow: / User-agent: ClaudeBot Allow: / User-agent: Claude-User Allow: / User-agent: PerplexityBot Allow: / User-agent: Perplexity-User Allow: / User-agent: Google-Extended Allow: / User-agent: Applebot-Extended Allow: / User-agent: CCBot Allow: / # Default for any other bot User-agent: * Allow: / Sitemap: https://yourdomain.com/sitemap.xml
Save this as robots.txt at the root of your site. Replace the sitemap URL with yours. That's it.
When you might want to block (rare for local businesses)
Some legitimate reasons to block specific AI bots:
- You publish proprietary research or paid content you don't want absorbed into AI training
- Your site is bandwidth-constrained and AI bots are causing real load problems (verify in your server logs first)
- You want to be visible in live AI answers but not absorbed into training data — block the training crawler (e.g. GPTBot) and allow the live crawler (ChatGPT-User)
For 99% of local businesses, none of these apply. The default should be allow-everything.
llms.txt: the new standard
llms.txt is a proposal from Jeremy Howard (Answer.AI), published in September 2024. The idea is simple: give AI a Markdown file that tells it which pages on your site are most important and what they cover, in a format an LLM can read directly.
As of mid-2026, llms.txt is not yet required by any major AI. But Anthropic and OpenAI have both signaled interest, and a growing list of agentic AI tools (Cursor, Cline, browser agents) explicitly read llms.txt as a hint for what to fetch first. Adopting now positions you ahead.
The format
A valid llms.txt has four sections:
- H1 title — the name of your business or project (required)
- Blockquote summary — one or two sentences (required)
- Optional details — extra context paragraphs (optional)
- H2 sections with link lists — your most important pages, grouped by topic (required)
The template for a local business
# Smith Plumbing > Family-owned plumbing service in Austin, TX. Specializing in > emergency repair, water heater installation, and drain cleaning > for residential and small commercial customers. ## Key pages - [Homepage](https://smithplumbing.com/): Overview of services and service area - [Services](https://smithplumbing.com/services/): Full service list with starting prices - [Service area](https://smithplumbing.com/areas/): Cities and zip codes we serve - [Pricing](https://smithplumbing.com/pricing/): Transparent pricing for common jobs - [FAQ](https://smithplumbing.com/faq/): Top customer questions answered ## Trust signals - [About](https://smithplumbing.com/about/): Owner bio, license, founding date - [Reviews](https://smithplumbing.com/reviews/): 247 reviews, 4.9 average ## Contact - Phone: (512) 555-0100 - Email: hello@smithplumbing.com - Hours: Mon–Fri 8am–5pm; 24/7 emergency line
Save this as llms.txt at the root of your site. The whole file should be under 5 KB — this is a hint, not your full content. If you want to publish a longer machine-readable version, use llms-full.txt for that instead and reference it from your llms.txt.
The Cloudflare gotcha
In July 2024, Cloudflare added a one-click “Block AI Scrapers and Crawlers” feature to every account. It was on-by-default for some plan tiers, and many small-business owners enabled it without realizing what it did.
If your site is on Cloudflare and your AI visibility scan shows zero crawler access, this is the first place to check:
- Log in to dash.cloudflare.com
- Select your domain
- Go to Security → Bots
- Find AI Scrapers and Crawlers and switch it OFF
- Also check Bot Fight Mode — set it to OFF or Medium (Aggressive blocks legitimate AI bots)
Changes propagate within minutes. Re-run an AI visibility scan afterward to confirm.
The security plugin gotcha
WordPress security plugins are the second most common accidental AI block. Three to check first:
- Wordfence: Dashboard → Firewall → Blocking. Look for any custom rule blocking unknown user agents or specific IPs in OpenAI/Anthropic ranges.
- Sucuri: Settings → Bot Protection. Disable aggressive bot blocking, or add AI crawler user agents to the allowlist.
- iThemes Security (Solid Security): Settings → Advanced → Hide Backend & Banned Users. Check the “HackRepair” blacklist isn't enabled, and review any custom user-agent rules.
How to test what AI sees
The most reliable test is to spoof an AI user-agent and see what your server returns. If you have access to a terminal:
# Pretend to be GPTBot and fetch your homepage curl -A "GPTBot/1.0" -I https://yourdomain.com # Pretend to be ClaudeBot curl -A "ClaudeBot/1.0" -I https://yourdomain.com # Pretend to be PerplexityBot curl -A "PerplexityBot/1.0" -I https://yourdomain.com # Look for "200 OK" — anything else (403, 503, 429) means you're being blocked.
Run those commands. You want HTTP/2 200 or HTTP/1.1 200 OK. Any of the following means you're blocked:
- 403 Forbidden — your server is explicitly blocking the bot
- 503 Service Unavailable — Cloudflare or another CDN is challenging the request
- 429 Too Many Requests — you're rate-limiting the bot
Don't have a terminal? The free AI visibility scan does this automatically — it sends requests pretending to be each major AI bot and reports back what they see.
How often AI bots re-fetch your site
After you fix crawler access, AI doesn't see the changes immediately. Each AI has its own re-crawl rhythm:
- ChatGPT-User / Claude-User / Perplexity-User: instant — these fetch live whenever a user asks a question that needs your data
- GPTBot / ClaudeBot / Google-Extended: weeks to months — training crawlers re-index on slower schedules; new content typically appears in 4–8 weeks
- CCBot: roughly monthly full crawl
The practical implication: live citation lift can happen the same day. Training-data lift takes a couple of months. Both compound over time.
The bottom line
Crawler access is the foundation. Schema markup, content quality, E-E-A-T signals — none of them matter if AI can't fetch your site in the first place.
Five minutes with your robots.txt, ten minutes adding an llms.txt, and a check on your Cloudflare and security plugin settings. That's the entire access layer. Once it's working, every other AI visibility improvement stacks on top.
Find out which AI crawlers can read your site
Our free scan sends requests pretending to be GPTBot, ClaudeBot, PerplexityBot, and 9 other AI crawlers — and shows you exactly who's being blocked. Plus it checks your llms.txt, sitemap, schema, and 30 other AI visibility signals. Under 10 seconds. No signup.
Run my free scan →Written by the team at Kesem Marketing, a digital agency helping small businesses get found in the AI-first era.