我重建了我的博客缓存。现在机器人是我的受众。

我重建了我的博客缓存。现在机器人是我的受众。
I rebuilt my blog's cache. Bots are the audience now

原始链接: https://hoeijmakers.net/thirty-years-of-caching-sorted-in-an-afternoon/

几十年以来，HTTP缓存一直是一个令人沮丧且复杂的话题，尽管它很重要。作者自90年代初就是一名网页开发者，发现现有的文档不足以真正*理解*和控制缓存策略。随着Claude和ChatGPT等人工智能工具的出现，情况发生了改变。就像在人工智能的帮助下识别螺丝，从而可以整理一个杂乱的螺丝收集一样，作者终于可以通过互动提问和定制化的解释来理解缓存的复杂性——头部、TTL、CDN行为。然而，掌握缓存的紧迫性并非源于对人类用户性能的提升。网络流量发生重大变化——越来越多的流量来自人工智能爬虫、搜索引擎索引和其他机器读者，这使得有效的缓存至关重要。这些系统优先考虑高效的数据访问而非渲染，使缓存成为核心基础设施组件。作者使用Cloudflare实施了一种缓存策略，并非为了加快人们的页面加载速度，而是为了这些日益重要的机器读者提供可靠且经济高效的访问。人工智能并没有发明缓存，但它提供了最终*观察*和塑造该系统的“工具”。

## 机器人现在主导网络流量，引发创作者担忧最近 Hacker News 上 robhoeijmakers 重建博客缓存的讨论引发了一个日益增长的趋势：网络流量越来越多地由机器人驱动，特别是人工智能爬虫，而非人类用户。作者优先考虑这些机器人的快速加载速度，认识到它们现在是主要受众。然而，这种转变引发了担忧。许多评论者表示沮丧，因为人工智能正在为训练目的抓取内容，*没有*署名或为原始内容创作者带来任何好处——这可能会损害网站流量和收入。一些人正在屏蔽人工智能爬虫，但复杂的机器人越来越难检测，需要像 Cloudflare 这样的解决方案。核心问题是它对建立在线声誉的影响。传统上，写作会积累“象征资本”，但如果只有机器人阅读，这种好处就会消失。几位用户报告由于人工智能直接回答问题而不是链接回源网站而导致流量大幅下降，这呼应了“死亡互联网”宣言中的观点。一些创作者甚至完全放弃了他们的网站，认为他们的作品正在被剥削。

原文

HTTP caching never quite made sense, until AI tools made it legible enough to actually implement. And the reason it finally mattered: the audience had quietly changed.

Thirty Years of Caching, Sorted in an Afternoon

I have a jar of screws on my workbench. For years, I would fish through it looking for the right size, usually not finding it.

Last week I sorted them, by type, by thread, by length. I used ChatGPT to help: photographed a handful, asked what I was looking at, got the taxonomy straight. Once I could name them, I could organise them. You can only sort what you understand.

HTTP caching was my jar of screws.

Thirty years of fog

I have been building for the web since the early nineties. Caching was always there, somewhere in the background, doing something. I knew enough to be aware of it, not enough to actually control it. Cache-Control headers, TTL values, edge behaviour, the difference between what a CDN caches and what a browser holds, what gets invalidated when and why. Every time I approached it seriously, I ran into a wall of context I did not quite have.

The documentation exists. The concepts are not secret. But caching is one of those domains where the gap between understanding the vocabulary and being able to apply it correctly is surprisingly wide. I would read, nod, implement something plausible, and move on with lingering doubt.

This year, working with Claude, that changed. The parallel is closer than it sounds. I had the pieces in front of me for years. What I was missing was someone to explain what I was looking at.

New instruments

We went through the whole thing together. What my Cloudflare Workers were actually doing. What headers were being sent and why. What a browser would cache versus what the edge would cache. Where the inconsistencies were. What a coherent strategy would look like for a site like mine: a moderate personal blog with a global readership, running on Ghost, served through Cloudflare.

It took an afternoon. Not because the subject got simpler, but because I had, for the first time, an instrument that could hold the full complexity with me. Ask a question, get an answer calibrated to my exact setup, follow a thread, revise, implement, check. The back-and-forth that used to require either a specialist or weeks of trial and error compressed into something manageable.

The result was a caching strategy I actually understand. Headers that mean what I intend. Edge behaviour that is consistent. Rules I can read back and explain.

The audience had already changed

The reason it finally felt urgent was not vanity metrics or pagespeed scores. It was a shift in who was actually reading.

Human visitors are still there. But a growing share of traffic to a site like mine now comes from crawlers: search indexers, AI training pipelines, retrieval systems that serve content to agents rather than browsers. These systems do not render pages. They do not wait for JavaScript. They send a request, receive a response, and move on. For them, caching is not a convenience. It is the primary mechanism that determines cost, latency, and reliability of access.

If you care about how your content moves through the world now, including through AI systems, you have to care about caching. Not as a performance optimisation for human browsers, but as infrastructure for machine readership.

That reframing changed what I was optimising for. HTML cached at the edge, globally, with consistent headers and predictable expiry. Not because I expect a person in Singapore to shave 200ms off their pageload, but because the next request for that page is more likely to come from a retrieval system than a browser, and the request after that, and the one after that.

Cloudflare tiered caching on top of Ghost blog.

The caching itself is not new. The concepts are decades old. What changed is that I could finally see the system clearly enough to shape it. With the right instrument, a domain that had been opaque for thirty years became workable in a single session.

That is not a small thing. There are other jars of screws on the workbench.

🗒️

The caching strategy described here was implemented using Cloudflare Workers, Cache Rules, and D1 for request logging. A public cache-stats dashboard shows the current breakdown of traffic by type: human, AI crawler, SEO crawler, and unknown.

My Visitors Are Not All Human. That Is Fine.

I built a traffic dashboard for my own site. What I found wasn’t alarming, it was interesting. A publisher’s notes on bots, borrowed identities, and editorial agency.

When Bots Become Readers: Publishing in the Age of AI Crawlers

Listening to Matthew Prince on Azim Azhar’s podcast made me reflect on who actually reads my blog. People (like you), machines, or both.

The End of Google Search (as we know it)

Google didn’t warn me. It just erased my blog. What looked like a bug turned out to be a glimpse into the future of search—and it’s not built for us anymore.