I prompted ChatGPT, Claude, Perplexity, and Gemini and watched my Nginx logs

原始链接: https://surfacedby.com/blog/nginx-logs-ai-traffic-vs-referral-traffic

## AI 与您的网站：日志揭示的内容最近的测试调查了像 ChatGPT、Claude、Perplexity 和 Gemini 这样的人工智能产品在回应查询时，是*直接*抓取网站页面，还是仅依赖预构建的索引。答案是复杂的，通过分析 nginx 访问日志确定。研究发现**明确的“提供方侧抓取”**——来自人工智能本身的请求，可以通过独特的 user-agent 令牌（例如 `ChatGPT-User/1.0`、`Claude-User/1.0`、`Perplexity-User/1.0`）和缺乏 referrer 识别。这些发生在人工智能生成回复时。与此同时，也追踪到了**真实的访问者**，可以通过标准的浏览器 user-agent 和人工智能产品作为 referrer 识别。重要的是，**Google/Gemini 不使用专门的抓取 user-agent。** Gemini 依赖与常规 Google 搜索相同的 Googlebot 索引，因此仅通过日志无法隔离人工智能驱动的抓取。这种区别至关重要：追踪*直接抓取*可以显示人工智能何时正在主动读取网站，而追踪*用户访问*则显示人工智能驱动的推荐的影响。准确的追踪需要区分这些信号，避免与常规搜索索引或训练机器人混淆。网站所有者可以使用记录的抓取 user-agent 可靠地追踪直接抓取，并通过 referrer 数据追踪真实访问。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录我提示了ChatGPT、Claude、Perplexity和Gemini，并观察了我的Nginx日志 (surfacedby.com) 35 分，由 startages 36 分钟前发布 | 隐藏 | 过去 | 收藏 | 1 条评论帮助 realaccfromPL 14 分钟前 [–] 看起来是个很有趣的练习，我也试试，谢谢你的想法！回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

原文

Last month I wanted a straight answer to a question most AI visibility write-ups dodge. When someone asks ChatGPT, Claude, Perplexity, Gemini, or Google AI Mode about a site I own, does that AI product actually fetch the page, or does it answer from an index it built earlier?

The way to get a straight answer was the unfashionable one. Read the nginx access log.

This post walks through what the logs captured across the five AI products I tested, what they did not capture, and what that difference lets a product safely track. Every claim in the sections that follow is either something the server logged or a structural fact documented by the vendor.

Two signals, not one

A marketer saying “my site got traffic from AI” could mean two different things, and the logs prove they are different things.

Provider-side fetch. The AI provider itself hits my origin. The request usually arrives with a dedicated user-agent token, usually with no referrer, and usually inside a short burst while the model is deciding which page to cite.
Real clickthrough visit. A human reads the AI answer, clicks a citation link, and arrives as a normal browser. Chrome-shaped user-agent, normal cookies, the AI product as the referrer.

Collapsing these into one AI-traffic number papers over the most useful distinction in the data. One is the model reaching out to read you. The other is a human reading you because the model pointed. Different lever, different measurement, different copy.

How I instrumented the experiment

Nothing exotic. A custom nginx log format that captures the bits the default combined format compresses out, plus a tail -F next to a browser tab in each AI product.

log_format ai_probe '$time_iso8601 $remote_addr "$request" $status '
                    '"$http_user_agent" "$http_referer"';

I prompted each AI product with questions engineered to likely produce a citation or require a page fetch against a domain I control. I reran the same prompts across sessions and IPs so a transient cache hit would not hide the retrieval path.

What ChatGPT did

Captured, reproducibly, across multiple runs:

User-agent contained ChatGPT-User/1.0.
No referrer on the captured requests.
Multiple candidate pages fetched in tight bursts while ChatGPT was composing an answer.
More than one source IP observed inside the same burst.

2026-03-18T14:23:41+00:00 203.0.113.42 "GET /a HTTP/1.1" 200 "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot" "-"
2026-03-18T14:23:41+00:00 203.0.113.58 "GET /b HTTP/1.1" 200 "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot" "-"
2026-03-18T14:23:42+00:00 203.0.113.42 "GET /c HTTP/1.1" 200 "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot" "-"

This is enough to state the finding plainly. ChatGPT performs provider-side origin retrieval through ChatGPT-User. The burst pattern across multiple IPs matches OpenAI’s own description of the agent in its bots documentation.

What Claude did

Captured:

User-agent contained Claude-User/1.0.
No referrer on the captured requests.
/robots.txt requested first.
Redirects followed normally. A /plugins request turning into /plugins/ was handled as expected.

2026-03-19T09:14:08+00:00 198.51.100.11 "GET /robots.txt HTTP/1.1" 200 "Mozilla/5.0 (compatible; Claude-User/1.0; [email protected])" "-"
2026-03-19T09:14:09+00:00 198.51.100.11 "GET /plugins HTTP/1.1" 301 "Mozilla/5.0 (compatible; Claude-User/1.0; [email protected])" "-"
2026-03-19T09:14:09+00:00 198.51.100.11 "GET /plugins/ HTTP/1.1" 200 "Mozilla/5.0 (compatible; Claude-User/1.0; [email protected])" "-"

Same kind of finding. Claude performs provider-side origin retrieval through Claude-User. The robots precheck matches Anthropic’s documented behavior in its crawler docs.

What Perplexity did

Captured:

User-agent contained Perplexity-User/1.0.
A direct fetch observed on a specific product page.
No referrer on that request.

2026-03-20T17:02:33+00:00 192.0.2.73 "GET /plugins/product-builder-for-woocommerce/ HTTP/1.1" 200 "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user" "-"

One thing I will not generalize. Perplexity fetched the origin in the runs I captured, but I only captured a few Perplexity runs, and Perplexity is architecturally capable of answering from its own index without hitting the origin. The safe wording is that Perplexity can perform direct origin retrieval; whether it always does is not something one log file can prove. See Perplexity’s bots documentation for their own framing.

What Google and Gemini did not prove

I captured real clickthrough visits from https://gemini.google.com/ and https://www.google.com/. Normal browser user-agents, tester IP, Google-product referrer. Those are real people arriving after reading an AI answer.

I did not capture a distinct provider-side fetch for Gemini or Google AI Mode. The structural reason matters more than the gap.

Google does not publish a distinct retrieval user-agent for Gemini. Per Google’s own crawler documentation, AI Overviews and AI Mode answer from the same Search index that Googlebot populates. A Gemini-User token that would show up in an access log does not exist, because Google does not emit one.

Three practical consequences worth stating out loud.

A Googlebot hit on your origin cannot be attributed to Gemini versus regular Search from the request alone.
Blocking Google-Extended does not stop Googlebot. It only controls whether Googlebot-crawled content may be used for Gemini training and grounding.
“Google did not fetch my page during my test” is not structurally observable the way it is for ChatGPT or Claude. Silence from Google is not evidence of no fetch.

This asymmetry is the single most misreported finding in AI crawler write-ups. It should not be.

What a product can safely track

Putting only the proven layers together, there are two tracking classes a product can offer without overclaiming.

Provider fetch

Vendor-documented retrieval user-agents hitting your origin.

ChatGPT-User (OpenAI)
Claude-User (Anthropic)
Perplexity-User (Perplexity)
Meta-ExternalFetcher (Meta, documented retrieval bot I did not observe in this run but it belongs in the same class)

Real visit

Normal browser user-agent with an AI product as the referrer.

chatgpt.com
claude.ai
perplexity.ai
gemini.google.com
google.com as a broader Google-origin bucket, with no way to isolate AI Mode from classic Search using HTTP alone

Search-indexing bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot, Googlebot, Bingbot) should not be folded into the provider-fetch bucket. They are not a live retrieval signal for any specific user question. Mixing them in turns the metric into noise.

Training bots (GPTBot, ClaudeBot, CCBot) are a separate signal again, and they have no business inside a retrieval count.

Why careful wording matters

Any metric that says “ChatGPT fetched your page” when the log actually shows PerplexityBot is a product that will be right about trends and wrong about individual rows. The first time a user looks at a row and knows it is wrong, the whole dashboard loses credibility.

Careful wording is boring and it is the only wording that survives a smart customer checking one row.

Appendix: vendor-documented bot taxonomy

The experiment proved three retrieval bots in action: ChatGPT-User, Claude-User, Perplexity-User. The table below is the full vendor-documented set across the major labs, classified by purpose.

retrieval: user-initiated fetch, typically when a human pastes a link into the AI product or an agent follows a link on the user’s behalf.
search_indexing: crawls pages so the AI product can cite them from its index at answer time.
training: collects pages to train future models; not a live retrieval signal.

Five points to carry out of this table.

There is no dedicated Gemini bot. Gemini answers use content indexed by Googlebot and gated by Google-Extended. Treating Gemini as if it had a retrieval user-agent is wrong.
Claude has three distinct bots. ClaudeBot is training; Claude-SearchBot is Claude’s search index; only Claude-User is the user-facing retrieval signal. Mixing them up is the most common write-up mistake.
Meta publishes at least two relevant bots. Meta-ExternalFetcher is the retrieval counterpart. Meta documents that it may bypass robots.txt because a human or agent followed the link.
Copilot does not have a distinct crawler. Copilot is grounded in the Bing index populated by Bingbot.
The proven-fetch classes in the experiment are the retrieval class of this table. That is a feature, not a coincidence. Training and search-indexing bots are not expected to hit the origin in response to a specific user query, so their absence in the logs is not evidence against them.

Check this on your own site

If the appendix above made you curious about your own robots.txt, our robots.txt checker reads your live file and tells you which retrieval, search, and training user-agents it allows or blocks. No account needed. That is the fastest way to turn the taxonomy in this post into one concrete answer about your domain.