我们构建了一个边缘缓存层来消除冷启动。
We architected an edge caching layer to eliminate cold starts

原始链接: https://www.mintlify.com/blog/page-speed-improvements

Mintlify,为开发者文档提供每月7200万页浏览量,面临性能问题,表现为冷启动缓慢——影响了近25%的访客。他们现有的Next.js ISR缓存难以跟上频繁的部署节奏(每天多次),每次更新都会使整个缓存失效。 为了解决这个问题,他们使用Cloudflare产品构建了一个自定义边缘缓存层。一个Cloudflare Worker代理所有流量,确定部署配置并利用具有唯一密钥的15天TTL边缘缓存。关键在于,他们通过“重新验证”(响应式,由部署后的版本不匹配触发)和“预热”(主动式,由内容更新触发)将部署与缓存失效分离。 Durable Objects管理重新验证锁,防止冲突更新,而Cloudflare Queues处理异步缓存预热,以避免压垮源服务器。通过Cloudflare管理API进行主动预热,进一步确保缓存的新鲜度。 这种架构将他们的缓存命中率提高到100%,消除了冷启动,并提高了速度和可靠性。Mintlify提倡将重点从优化源端性能转向积极的缓存和静态站点生成,让边缘高效处理请求。

## Mintlify 的边缘缓存层:摘要 Mintlify 最近构建了一个定制的边缘缓存层,以解决冷启动问题并提高其 7200 万月度页面浏览量的性能。Hacker News 上的讨论揭示了关于这种复杂性是否必要的争论。 一些人认为简单的基于哈希的缓存键就足够了,但 Mintlify 选择使用定制的键来最大程度地减少对现有 Next.js 应用程序的更改。这允许在无需广泛修改源站的情况下进行全站缓存失效。 一些评论员指出潜在的简化方法,例如利用静态站点生成 (SSG) 和 CDN 功能(如“stale-while-revalidate”),可能完全避免了对复杂缓存层的需求。然而,Mintlify 解释说,他们的做法是一种务实的选择,在理想架构与现有约束和工程工作量之间取得平衡。 这次对话凸显了优雅的简洁性与使用特定框架(如 Next.js 和 Vercel)以及现有基础设施的现实之间的常见矛盾。许多人认为,这种复杂性源于所选的技术栈,而不是问题本身。
相关文章

原文

Mintlify powers documentation for tens of thousands of developer sites, serving 72 million monthly page views. Every pageload matters when millions of developers and AI agents depend on your platform for technical information.

We had a problem. Nearly one in four visitors experienced slow cold starts when accessing documentation pages. Our existing Next.js ISR caching solution could not keep up with deployment velocity that kept climbing as our engineering team grew.

We ship code updates multiple times per day and each deployment invalidated the entire cache across all customer sites. This post walks through how we architected a custom edge caching layer to decouple deployments from cache invalidation, bringing our cache hit rate from 76% to effectively 100%.

We achieved our goal of fully eliminating cold starts and used a veritable smorgasbord of Cloudflare products to get there.

Cloudflare Architecture

ComponentPurpose
Workersdocs-proxy handles requests; revalidation-worker consumes the queue
KVStore deployment configs, version IDs, connected domains
Durable ObjectsGlobal singleton coordination for revalidation locks
QueuesAsync message processing for cache warming
CDN CacheEdge caching with custom cache keys via fetch with cf options
Zones/DNSRoute traffic to workers

We could have built a similar system on any hyperscaler, but leaning on Cloudflare's CDN expertise, especially for configuring tiered cache, was a huge help.

It is important that you understand the difference between two key terms which I use throughout the following solution explanation.

  • Revalidations are a reactive process triggered when we detect a version mismatch at request time (e.g., after we deploy new code)
  • Prewarming is a proactive process triggered when customers update their documentation content, before any user requests it

Both ultimately warm the cache by fetching pages, but they differ in when and why they're triggered. More on this in sections 2 through 4 below.

1. The Proxy Layer

We placed a Cloudflare Worker in front of all traffic to Mintlify hosted sites. It proxies every request and contains business logic for both updating and using the associated cache. When a request comes in, the worker proceeds through the following steps.

  1. Determines the deployment configuration for the requested host
  2. Builds a unique cache key based on the path, deployment ID, and request type
  3. Leverages Cloudflare's edge cache with a 15-day TTL for successful responses

Our cache key structure shown below. The cachePrefix roughly maps to the name of a particular customer, deploymentId identifies which Vercel deployment to proxy to, path is needed to know the correct page to fetch and then contentType functions such that we can store both html and rsc variants for every page.

`${cachePrefix}/${deploymentId}/${path}#${kind}:${contentType}`;

For example: acme/dpl_abc123/getting-started:html and acme/dpl_abc123/getting-started:rsc.

2. Automatic Version Detection and Revalidation

The most innovative aspect of our solution is automatic version mismatch detection.

When we deploy a new version of our Next.js client to production, Vercel sends a deployment.succeeded webhook. Our backend receives this and writes the new deployment ID to Cloudflare's KV.

KV.put('DEPLOY:{projectId}:id', deploymentId);

Then, when user requests come through the docs-proxy worker, it extracts version information from the origin response headers and compares it against the expected version in KV.

gotVersion = originResponse.headers['x-version'];
projectId = originResponse.headers['x-vercel-project-id'];

wantVersion = KV.get('DEPLOY:{projectId}:id');

shouldRevalidate = wantVersion != gotVersion;

When a version mismatch is detected, the worker automatically triggers revalidation in the background using ctx.waitUntil(). The user gets the previously cached stale version immediately. Meanwhile, cache warming of the new version happens asynchronously in the background.

We do not start serving the new version of pages until we have warmed all paths in the sitemap. Since, when you load a new version of any given page after an update, you have to make sure that all subsequent navigations also fetch that same version. If you were on v2 and then randomly saw v1 designs when navigating to a new page it would be jarring and worse than them loading slowly.

3. The Revalidation Coordinator

Our first concern when triggering revalidations for sites was that we were going to create a race condition where we had multiple updates in parallel for a given customer and start serving traffic for both new and old versions at the same time.

We decided to use Cloudflare's Durable Objects (DO) as a lock around the update process to prevent this. We execute the following steps during every attempted revalidation trigger.

  1. Check the DO storage for any inflight updates, ignore the trigger if there is one
  2. Write to the DO storage to track that we are starting an update and "lock"
  3. Queue a message containing the cachePrefix, deploymentId, and host info for the revalidation worker to process
  4. Wait for the revalidation worker to report completion, then "unlock" by deleting the DO state

We also added a failsafe where we automatically delete the DO's data and unlock in step 1 if it has been held for 30 minutes. We know from our analytics that no update should take that long and it is a safe timeout.

4. Revalidation Worker

Cloudflare Queues make it easy to attach a worker that can consume and process messages, so we have a dedicated revalidation worker that handles both prewarming (proactive) and version revalidation (reactive). Using a queue to control the rate of cache warming requests was mission critical since without it, we'd cause a thundering herd that takes down our own databases.

Each queue message contains the full context for a deployment: cachePrefix, deploymentId, and either a list of paths or enough info to fetch them from our sitemap API. The worker then warms all pages for that deployment before reporting completion.

// Get paths from message or fetch from sitemap API
paths = message.paths ?? fetchSitemap(cachePrefix)

// Process in batches of 6 (Cloudflare's concurrent connection limit)
for batch in chunks(paths, 6):
  awaitAll(
    batch.map(path =>
      // Warm both HTML and RSC variants
      for variant in ["html", "rsc"]:
        cacheKey = "{cachePrefix}/{deploymentId}/{path}#{variant}"
        headers = { "X-Cache-Key": cacheKey }
        if variant == "rsc":
          headers["RSC"] = "1"
        fetchWithRetry(originUrl, headers)
    )
  )

Once all paths are warmed, the worker reads the current doc version from the coordinator's DO storage to ensure we're not overwriting a newer version with an older one. If the version is still valid, it updates the DEPLOYMENT:{domain} key in KV for all connected domains and notifies the coordinator that cache warming is complete. The coordinator only unlocks after receiving this completion signal.

5. Proactive Prewarming on Content Updates

Beyond reactive revalidation, we also proactively prewarm caches when customers update their documentation. After processing a docs update, our backend calls the Cloudflare Worker's admin API to trigger prewarming:

POST /admin/prewarm HTTP/1.1
Host: workerUrl
Content-Type: application/json

{
  "paths": ["/docs/intro", "/docs/quickstart", "..."],
  "cachePrefix": "acme/42",
  "deploymentId": "dpl_abc123",
  "isPrewarm": true
}

The admin endpoint accepts batch prewarm requests and queues them for processing. It also updates the doc version in the coordinator's DO to prevent older versions from overwriting newer cached content.

This two-pronged approach ensures caches stay warm through both:

  • reactive revalidation system triggered when our code deployments create version mismatches
  • proactive prewarming triggered when customers update their documentation content

We have successfully moved our cache hit rate to effectively 100% based on monitoring logs from the Cloudflare proxy worker over the past 2 weeks. Our system solves for revalidations due to both documentation content updates and new codebase deployments in the following ways.

For code changes affecting sites (revalidation)

  1. Vercel webhook notifies our backend of the new deployment
  2. Backend writes the new deployment ID to Cloudflare KV
  3. The first user request detects the version mismatch
  4. Revalidation triggers in the background
  5. The coordinator ensures only one cache warming operation runs globally
  6. All pages get cached at the edge with the new version

For customer docs updates (prewarming)

  1. Update workflow completes processing
  2. Backend proactively triggers prewarming via admin API
  3. All pages are warmed before users even request them

Our system is also self-healing. If a revalidation fails, the next request will trigger it again. If a lock gets stuck, alarms clean it up automatically after 30 minutes. And because we cache at the edge with a 15-day TTL, even if the origin goes down, users still get fast responses from the cache. Improving reliability as well as speed!

If you're running a dynamic site and chasing P99 latency at the origin, consider whether that's actually the right battle. We spent weeks trying to optimize ours (RSCs, multiple databases, signed S3 URLs) and the system was too complicated to debug meaningfully.

The breakthrough came when we stopped trying to make dynamic requests faster and instead made them not happen at all. Push your dynamic site towards being static wherever possible. Cache aggressively, prewarm proactively, and let the edge do what it's good at.

联系我们 contact @ memedata.com