广泛日志记录：Stripe 的规范日志行模式

广泛日志记录：Stripe 的规范日志行模式
Wide logging: Stripe's canonical log line pattern

原始链接: https://blog.alcazarsec.com/tech/posts/wide-logging

## 规范化日志：摘要传统的日志记录方式常常将关键信息分散在多行中，阻碍了高效的故障诊断。Stripe的解决方案——现在被称为“宽事件”或“规范化日志行”——提倡为每个工作单元（通常是一个请求）发出一个*单一*的、结构化的记录，其中包含所有重要字段。这包括诸如路由、方法、状态、持续时间、用户ID、构建/部署ID、功能标志，以及关键的、用于特定失败原因的稳定`error_slug`。这种预连接的数据允许直接查询完整的请求，而不是手动重建。好处不仅在于改进了调试，还在于能够进行强大的分析，用于发布影响、客户支持和产品洞察。除了基本信息外，需要记录的关键字段包括路由*模板*（例如`/users/{user_id}`）、发布元数据（构建ID）、执行成本（数据库时间）和决策输入（功能标志）。虽然像用户ID这样高基数度的字段可以通过适当的存储（如ClickHouse或BigQuery）来管理，但核心原则是*先*记录，然后聚合——保留上下文以进行更深入的关联，并回答关于系统行为的复杂问题。稳定的模式和一致的发出，即使在发生故障时，对于成功至关重要。

原文

Most logging is too narrow.

One line has the route. Another has the user. Another has the timeout. Another has the feature flag. Another has the deploy SHA.

Then an incident happens and you end up doing joins by hand.

Stripe’s answer is canonical log lines. The modern name is usually wide events. The pattern is simple: emit one structured record per unit of work with all the important fields already attached.

For a web service, that usually means one log event at the end of every request.

The Pattern

A canonical log line is the summary row for a request.

It should include the fields you always wish you had in one place:

route
method
status
duration
user or account ID
request ID and trace ID
build or deploy ID
feature flags
downstream timings
error code

In raw form it might look like this:

ts=2026-03-16T12:03:41Z service=api env=prod route=/v1/charges method=POST status=500 request_id=req_123
account_id=acct_456 build_id=9f2c1d7 feature_flag.payments_v2=true duration_ms=843 db_ms=792
db_queries=18 cache_hit=false error_slug=charge_db_timeout

This is useful because the log line is already pre-joined. You are not reconstructing a request from fragments. You are querying complete rows.

That sounds minor, but it changes what production debugging feels like.

Why It Works

Stripe did two important things.

First, they treated the canonical line as critical infrastructure. It is emitted after the request finishes, and their implementation is hardened so the line still appears when exceptions happen.

Second, they did not stop at debugging. Stripe pushed these records into warehousing systems and used them for longer-term analysis and product surfaces like the Developer Dashboard.

That is the part many teams miss.

A canonical log line is more than a nicer log. It is a request-shaped data model.

If the schema is stable, the same event can support:

incident response
release analysis
customer support investigations
product analytics

Amazon describes a similar idea in the Builders Library: emit one structured request log entry per unit of work, then derive metrics later. Log first. Aggregate later.

What To Log

Most teams stop too early.

They log route, status, and latency. That is enough for a dashboard, but not enough for diagnosis.

The highest-value fields tend to be:

Route template: /teams/{team_id}/members/{user_id} is better than raw paths with IDs embedded in them.
Identity: user_id, account_id, API key ID, auth method.
Release metadata: Git SHA, build ID, deploy ring, region.
Execution cost: duration, DB time, query count, cache hit or miss, retry count.
Decision inputs: feature flags, experiment variant, plan tier, client version.
Outcome: status code, throttled yes or no, fallback path used, error slug.

Two fields are especially underrated.

The first is build_id. Metrics tell you that latency went up. build_id tells you which deploy owns the regression.

The second is an error_slug. Not just an exception class. A stable identifier for the exact failure site or failure reason.

That is the difference between “timeouts increased” and “the timeout came from the new write path behind feature_flag.double_write.”

The Real Benefit

The real power of wide logging is not observability. It is correlation.

Once every request carries business context and execution context in the same row, you can ask much better questions:

Did the new build hurt only enterprise accounts?
Did the regression appear only on iOS 7.4.1?
Did variant B increase errors only in eu-west-1?
Did the slow requests all miss cache and hit the same downstream service?

Metrics are bad at this because they throw context away early.

Traditional logs are bad at this because the context is scattered.

Canonical log lines keep the context intact long enough to query it.

That is why the pattern keeps coming back under different names.

High Cardinality

This is where people get nervous.

user_id, request_id, build_id, and feature flags are high-cardinality fields. In many systems that is a warning sign.

The important distinction is where the cardinality lives.

High-cardinality values are often fine inside a wide event. They become expensive when you force them into the wrong indexing model.

That is why this pattern works best with systems designed for filtering and grouping over many dimensions. Stripe used Splunk and Redshift. Modern teams might use ClickHouse-backed tools, Honeycomb, BigQuery, or their own warehouse.

The storage choice is less important than the query shape. You want to slice rich rows, not pre-aggregate away the useful parts.

Common Mistakes

Only logging the happy path

The canonical event should be emitted in finally, ensure, or equivalent teardown logic. If it disappears on exceptions, it fails when you need it most.

Logging raw paths instead of route templates

/users/123/orders/456 is terrible for grouping. /users/{user_id}/orders/{order_id} is what you want.

Logging exception classes but not error reasons

TimeoutError is often too broad. An error slug gives you a stable grouping key tied to a real code path.

Dumping raw input into the event

Amazon recommends sanitizing and truncating request details before logging. That is important here. A rich event becomes dangerous fast if you start packing it with tokens, secrets, or arbitrary payloads.

Letting the schema drift

Field names become muscle memory. If one service logs user_id, another logs uid, and a third logs account_user, cross-service queries get messy fast.

Implementation

The usual implementation is middleware.

Create a request-scoped object at the start of the request. Let middleware and business logic add fields as work happens. Emit one structured line at the end.

If you use OpenTelemetry, the root span can play this role. If not, JSON or logfmt is fine.

A good starting schema is:

service.name
env
request_id
trace_id
route
method
status
duration_ms
user_id or account_id
build_id
error_slug
sample_rate

Then add fields whenever a real production question is hard to answer.

Summary

Canonical logging is a simple idea that pays for itself quickly.

Emit one rich, trustworthy event per request. Make it stable. Make it complete. Make sure it still appears on failures.

Once you do that, logs stop being breadcrumbs and start being records.