Blog · April 28, 2026 · 13 min read

The Local Business Owner's Guide to llms.txt, robots.txt, and AI Crawler Access

About a third of small-business sites we audit are accidentally blocking AI bots — through a security plugin, a Cloudflare toggle, or a robots.txt that someone added five years ago and forgot.

If your site is one of them, ChatGPT, Claude, and Perplexity will never recommend you, no matter how good your content is. They can't recommend what they can't read.

This guide covers every AI crawler you should know, the robots.txt and llms.txt templates that work, and how to test what AI actually sees when it visits your site. It's the most technical post on this blog, but every step is something a non-technical owner can run themselves with a browser and a terminal.

The two files that decide if AI can read your site

Two small text files, both at the root of your domain, control almost everything about how AI sees you:

robots.txt — the decades-old standard for telling crawlers which parts of your site are off-limits. Required reading for every AI bot.
llms.txt — a new (2024) standard for telling AI which pages on your site are most worth reading. Optional today, increasingly important.

They live at yourdomain.com/robots.txt and yourdomain.com/llms.txt. Anyone can fetch them. Open them in a browser right now and see what you have.

The 12 AI crawlers you should know

Every major AI company runs at least two crawlers: a training crawler that builds the model's background knowledge, and a live-user crawler that fetches your site in real time when someone asks a question. You almost always want both allowed.

GPTBot

OpenAI / ChatGPT

TrainingAllow

Builds ChatGPT’s knowledge base. Blocking it removes you from training data — meaning ChatGPT will never remember you.

ChatGPT-User

OpenAI / ChatGPT

LiveAllow

Fetches your site when a real user asks ChatGPT a question that needs current info. The most important crawler for live recommendations.

OAI-SearchBot

OpenAI / SearchGPT

SearchAllow

Powers SearchGPT’s real-time search index. Allow if you want to appear in OpenAI’s search-style results.

ClaudeBot

Anthropic / Claude

TrainingAllow

Anthropic’s primary training crawler. Strict about robots.txt — even a typo can lock it out.

Claude-User

Anthropic / Claude

LiveAllow

Fetches your site when a Claude user asks something needing current data. Allow this even if you block ClaudeBot.

PerplexityBot

Perplexity

SearchAllow

Indexes the web for Perplexity’s answer engine. Citations from Perplexity show source links prominently to users.

Perplexity-User

Perplexity

LiveAllow

Real-time fetch when a user asks Perplexity a question. The companion to PerplexityBot.

Google-Extended

Google / Gemini

TrainingAllow

Controls whether your site is used to train Gemini. Independent of regular Googlebot — you can allow regular search and block training, or vice versa.

Applebot-Extended

Apple / Siri Intelligence

TrainingAllow

Apple’s AI training opt-out. Allow it to be considered for Siri and Apple Intelligence answers.

CCBot

Common Crawl

TrainingAllow

Open-source web archive used by many AI training datasets, including Mistral, Llama, and dozens of smaller models.

MistralAI-User

Mistral / Le Chat

LiveAllow

Fetches when a Le Chat user needs live web data. Smaller volume than the major three, but growing in Europe.

Meta-ExternalAgent

Meta / Meta AI

TrainingAllow

Meta’s AI crawler for Llama models and Meta AI search. Independent of Facebook and Instagram crawlers.

The robots.txt template that works

For most local businesses, the right move is allow everything. Here's the copy-paste-ready robots.txt that explicitly opts in to all major AI crawlers:

# Allow major AI crawlers

User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: CCBot
Allow: /

# Default for any other bot
User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Save this as robots.txt at the root of your site. Replace the sitemap URL with yours. That's it.

When you might want to block (rare for local businesses)

Some legitimate reasons to block specific AI bots:

You publish proprietary research or paid content you don't want absorbed into AI training
Your site is bandwidth-constrained and AI bots are causing real load problems (verify in your server logs first)
You want to be visible in live AI answers but not absorbed into training data — block the training crawler (e.g. GPTBot) and allow the live crawler (ChatGPT-User)

For 99% of local businesses, none of these apply. The default should be allow-everything.

llms.txt: the new standard

llms.txt is a proposal from Jeremy Howard (Answer.AI), published in September 2024. The idea is simple: give AI a Markdown file that tells it which pages on your site are most important and what they cover, in a format an LLM can read directly.

As of mid-2026, llms.txt is not yet required by any major AI. But Anthropic and OpenAI have both signaled interest, and a growing list of agentic AI tools (Cursor, Cline, browser agents) explicitly read llms.txt as a hint for what to fetch first. Adopting now positions you ahead.

The format

A valid llms.txt has four sections:

H1 title — the name of your business or project (required)
Blockquote summary — one or two sentences (required)
Optional details — extra context paragraphs (optional)
H2 sections with link lists — your most important pages, grouped by topic (required)

The template for a local business

# Smith Plumbing

> Family-owned plumbing service in Austin, TX. Specializing in
> emergency repair, water heater installation, and drain cleaning
> for residential and small commercial customers.

## Key pages

- [Homepage](https://smithplumbing.com/): Overview of services and service area
- [Services](https://smithplumbing.com/services/): Full service list with starting prices
- [Service area](https://smithplumbing.com/areas/): Cities and zip codes we serve
- [Pricing](https://smithplumbing.com/pricing/): Transparent pricing for common jobs
- [FAQ](https://smithplumbing.com/faq/): Top customer questions answered

## Trust signals

- [About](https://smithplumbing.com/about/): Owner bio, license, founding date
- [Reviews](https://smithplumbing.com/reviews/): 247 reviews, 4.9 average

## Contact

- Phone: (512) 555-0100
- Email: hello@smithplumbing.com
- Hours: Mon–Fri 8am–5pm; 24/7 emergency line

Save this as llms.txt at the root of your site. The whole file should be under 5 KB — this is a hint, not your full content. If you want to publish a longer machine-readable version, use llms-full.txt for that instead and reference it from your llms.txt.

The Cloudflare gotcha

In July 2024, Cloudflare added a one-click “Block AI Scrapers and Crawlers” feature to every account. It was on-by-default for some plan tiers, and many small-business owners enabled it without realizing what it did.

If your site is on Cloudflare and your AI visibility scan shows zero crawler access, this is the first place to check:

Log in to dash.cloudflare.com
Select your domain
Go to Security → Bots
Find AI Scrapers and Crawlers and switch it OFF
Also check Bot Fight Mode — set it to OFF or Medium (Aggressive blocks legitimate AI bots)

Changes propagate within minutes. Re-run an AI visibility scan afterward to confirm.

The security plugin gotcha

WordPress security plugins are the second most common accidental AI block. Three to check first:

Wordfence: Dashboard → Firewall → Blocking. Look for any custom rule blocking unknown user agents or specific IPs in OpenAI/Anthropic ranges.
Sucuri: Settings → Bot Protection. Disable aggressive bot blocking, or add AI crawler user agents to the allowlist.
iThemes Security (Solid Security): Settings → Advanced → Hide Backend & Banned Users. Check the “HackRepair” blacklist isn't enabled, and review any custom user-agent rules.

How to test what AI sees

The most reliable test is to spoof an AI user-agent and see what your server returns. If you have access to a terminal:

# Pretend to be GPTBot and fetch your homepage
curl -A "GPTBot/1.0" -I https://yourdomain.com

# Pretend to be ClaudeBot
curl -A "ClaudeBot/1.0" -I https://yourdomain.com

# Pretend to be PerplexityBot
curl -A "PerplexityBot/1.0" -I https://yourdomain.com

# Look for "200 OK" — anything else (403, 503, 429) means you're being blocked.

Run those commands. You want HTTP/2 200 or HTTP/1.1 200 OK. Any of the following means you're blocked:

403 Forbidden — your server is explicitly blocking the bot
503 Service Unavailable — Cloudflare or another CDN is challenging the request
429 Too Many Requests — you're rate-limiting the bot

Don't have a terminal? The free AI visibility scan does this automatically — it sends requests pretending to be each major AI bot and reports back what they see.

How often AI bots re-fetch your site

After you fix crawler access, AI doesn't see the changes immediately. Each AI has its own re-crawl rhythm:

ChatGPT-User / Claude-User / Perplexity-User: instant — these fetch live whenever a user asks a question that needs your data
GPTBot / ClaudeBot / Google-Extended: weeks to months — training crawlers re-index on slower schedules; new content typically appears in 4–8 weeks
CCBot: roughly monthly full crawl

The practical implication: live citation lift can happen the same day. Training-data lift takes a couple of months. Both compound over time.

The bottom line

Crawler access is the foundation. Schema markup, content quality, E-E-A-T signals — none of them matter if AI can't fetch your site in the first place.

Five minutes with your robots.txt, ten minutes adding an llms.txt, and a check on your Cloudflare and security plugin settings. That's the entire access layer. Once it's working, every other AI visibility improvement stacks on top.

Find out which AI crawlers can read your site

Our free scan sends requests pretending to be GPTBot, ClaudeBot, PerplexityBot, and 9 other AI crawlers — and shows you exactly who's being blocked. Plus it checks your llms.txt, sitemap, schema, and 30 other AI visibility signals. Under 10 seconds. No signup.

Run my free scan →

Written by the team at Kesem Marketing, a digital agency helping small businesses get found in the AI-first era.