Last month I wanted a straight answer to a question most AI visibility write-ups dodge. When someone asks ChatGPT, Claude, Perplexity, Gemini, or Google AI Mode about a site I own, does that AI product actually fetch the page, or does it answer from an index it built earlier?
The way to get a straight answer was the unfashionable one. Read the nginx access log.
This post walks through what the logs captured across the five AI products I tested, what they did not capture, and what that difference lets a product safely track. Every claim in the sections that follow is either something the server logged or a structural fact documented by the vendor.
Two signals, not one
A marketer saying “my site got traffic from AI” could mean two different things, and the logs prove they are different things.
- Provider-side fetch. The AI provider itself hits my origin. The request usually arrives with a dedicated user-agent token, usually with no referrer, and usually inside a short burst while the model is deciding which page to cite.
- Real clickthrough visit. A human reads the AI answer, clicks a citation link, and arrives as a normal browser. Chrome-shaped user-agent, normal cookies, the AI product as the referrer.
Collapsing these into one AI-traffic number papers over the most useful distinction in the data. One is the model reaching out to read you. The other is a human reading you because the model pointed. Different lever, different measurement, different copy.
How I instrumented the experiment
Nothing exotic. A custom nginx log format that captures the bits the default combined format compresses out, plus a tail -F next to a browser tab in each AI product.
log_format ai_probe '$time_iso8601 $remote_addr "$request" $status '
'"$http_user_agent" "$http_referer"';
I prompted each AI product with questions engineered to likely produce a citation or require a page fetch against a domain I control. I reran the same prompts across sessions and IPs so a transient cache hit would not hide the retrieval path.
What ChatGPT did
Captured, reproducibly, across multiple runs:
- User-agent contained
ChatGPT-User/1.0. - No referrer on the captured requests.
- Multiple candidate pages fetched in tight bursts while ChatGPT was composing an answer.
- More than one source IP observed inside the same burst.
2026-03-18T14:23:41+00:00 203.0.113.42 "GET /a HTTP/1.1" 200 "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot" "-"
2026-03-18T14:23:41+00:00 203.0.113.58 "GET /b HTTP/1.1" 200 "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot" "-"
2026-03-18T14:23:42+00:00 203.0.113.42 "GET /c HTTP/1.1" 200 "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot" "-"
This is enough to state the finding plainly. ChatGPT performs provider-side origin retrieval through ChatGPT-User. The burst pattern across multiple IPs matches OpenAI’s own description of the agent in its bots documentation.
What Claude did
Captured:
- User-agent contained
Claude-User/1.0. - No referrer on the captured requests.
/robots.txtrequested first.- Redirects followed normally. A
/pluginsrequest turning into/plugins/was handled as expected.
2026-03-19T09:14:08+00:00 198.51.100.11 "GET /robots.txt HTTP/1.1" 200 "Mozilla/5.0 (compatible; Claude-User/1.0; [email protected])" "-"
2026-03-19T09:14:09+00:00 198.51.100.11 "GET /plugins HTTP/1.1" 301 "Mozilla/5.0 (compatible; Claude-User/1.0; [email protected])" "-"
2026-03-19T09:14:09+00:00 198.51.100.11 "GET /plugins/ HTTP/1.1" 200 "Mozilla/5.0 (compatible; Claude-User/1.0; [email protected])" "-"
Same kind of finding. Claude performs provider-side origin retrieval through Claude-User. The robots precheck matches Anthropic’s documented behavior in its crawler docs.
What Perplexity did
Captured:
- User-agent contained
Perplexity-User/1.0. - A direct fetch observed on a specific product page.
- No referrer on that request.
2026-03-20T17:02:33+00:00 192.0.2.73 "GET /plugins/product-builder-for-woocommerce/ HTTP/1.1" 200 "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user" "-"
One thing I will not generalize. Perplexity fetched the origin in the runs I captured, but I only captured a few Perplexity runs, and Perplexity is architecturally capable of answering from its own index without hitting the origin. The safe wording is that Perplexity can perform direct origin retrieval; whether it always does is not something one log file can prove. See Perplexity’s bots documentation for their own framing.
What Google and Gemini did not prove
I captured real clickthrough visits from https://gemini.google.com/ and https://www.google.com/. Normal browser user-agents, tester IP, Google-product referrer. Those are real people arriving after reading an AI answer.
I did not capture a distinct provider-side fetch for Gemini or Google AI Mode. The structural reason matters more than the gap.
Google does not publish a distinct retrieval user-agent for Gemini. Per Google’s own crawler documentation, AI Overviews and AI Mode answer from the same Search index that Googlebot populates. A Gemini-User token that would show up in an access log does not exist, because Google does not emit one.
Three practical consequences worth stating out loud.
- A
Googlebothit on your origin cannot be attributed to Gemini versus regular Search from the request alone. - Blocking
Google-Extendeddoes not stopGooglebot. It only controls whetherGooglebot-crawled content may be used for Gemini training and grounding. - “Google did not fetch my page during my test” is not structurally observable the way it is for ChatGPT or Claude. Silence from Google is not evidence of no fetch.
This asymmetry is the single most misreported finding in AI crawler write-ups. It should not be.
What a product can safely track
Putting only the proven layers together, there are two tracking classes a product can offer without overclaiming.
Provider fetch
Vendor-documented retrieval user-agents hitting your origin.
ChatGPT-User(OpenAI)Claude-User(Anthropic)Perplexity-User(Perplexity)Meta-ExternalFetcher(Meta, documented retrieval bot I did not observe in this run but it belongs in the same class)
Real visit
Normal browser user-agent with an AI product as the referrer.
chatgpt.comclaude.aiperplexity.aigemini.google.comgoogle.comas a broader Google-origin bucket, with no way to isolate AI Mode from classic Search using HTTP alone
Search-indexing bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot, Googlebot, Bingbot) should not be folded into the provider-fetch bucket. They are not a live retrieval signal for any specific user question. Mixing them in turns the metric into noise.
Training bots (GPTBot, ClaudeBot, CCBot) are a separate signal again, and they have no business inside a retrieval count.
Why careful wording matters
Any metric that says “ChatGPT fetched your page” when the log actually shows PerplexityBot is a product that will be right about trends and wrong about individual rows. The first time a user looks at a row and knows it is wrong, the whole dashboard loses credibility.
Careful wording is boring and it is the only wording that survives a smart customer checking one row.
Appendix: vendor-documented bot taxonomy
The experiment proved three retrieval bots in action: ChatGPT-User, Claude-User, Perplexity-User. The table below is the full vendor-documented set across the major labs, classified by purpose.
- retrieval: user-initiated fetch, typically when a human pastes a link into the AI product or an agent follows a link on the user’s behalf.
- search_indexing: crawls pages so the AI product can cite them from its index at answer time.
- training: collects pages to train future models; not a live retrieval signal.
Five points to carry out of this table.
- There is no dedicated Gemini bot. Gemini answers use content indexed by
Googlebotand gated byGoogle-Extended. Treating Gemini as if it had a retrieval user-agent is wrong. - Claude has three distinct bots.
ClaudeBotis training;Claude-SearchBotis Claude’s search index; onlyClaude-Useris the user-facing retrieval signal. Mixing them up is the most common write-up mistake. - Meta publishes at least two relevant bots.
Meta-ExternalFetcheris the retrieval counterpart. Meta documents that it may bypassrobots.txtbecause a human or agent followed the link. - Copilot does not have a distinct crawler. Copilot is grounded in the Bing index populated by
Bingbot. - The proven-fetch classes in the experiment are the retrieval class of this table. That is a feature, not a coincidence. Training and search-indexing bots are not expected to hit the origin in response to a specific user query, so their absence in the logs is not evidence against them.
Check this on your own site
If the appendix above made you curious about your own robots.txt, our robots.txt checker reads your live file and tells you which retrieval, search, and training user-agents it allows or blocks. No account needed. That is the fastest way to turn the taxonomy in this post into one concrete answer about your domain.