黄金集:概率系统的回归工程
Golden Sets: Regression Engineering for Probabilistic Systems

原始链接: https://heavythoughtcloud.com/knowledge/designing-a-golden-set

## 金色数据集:自信地发布人工智能 在没有彻底评估的情况下发布人工智能是有风险的,但对增长有价值。“金色数据集”是将主观改进(“似乎更好”)转化为可验证改进(“*确实*更好”)的关键。它们不仅仅是数据集,而是**带有明确评分协议的版本化案例**——超越模糊的基准,以 pinpoint 回归。 一个金色数据集包括代表性输入、预期结果和评分标准,以及与特定“变化面”(提示、模型、检索等)相关的验收阈值。 关注**与失败类别相关的多指标门槛**,而不是单一质量分数。 生产事故是宝贵的测试用例——每一个严重的事故都应该添加到数据集中。金色数据集有助于在客户、工程师或合规部门发现问题*之前*发现问题。 **主要用途:**发布前回归测试、比较变体以及确保更改不会降低关键行为,例如安全性、准确性或成本。从小处着手,关注行为类别(成功、拒绝、回退),并在可能的情况下利用确定性断言。 最终,金色数据集并非监控的替代品,而是至关重要的发布门槛,确保人工智能的改进是真实的,并且不会带来不可接受的成本。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 黄金集合:概率系统的回归工程 (heavythoughtcloud.com) 4 点赞 来自 ryan-s 2 小时前 | 隐藏 | 过去 | 收藏 | 讨论 帮助 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索:
相关文章

原文

You can ship AI without evaluation.

You can also ship without tests.

Both approaches create compelling personal growth opportunities.

Golden sets are how you turn "it seems better" into "it is better" - or, more realistically, "it broke in fewer expensive ways than the last version".

Key Takeaways

  • Golden sets are not just datasets. They are versioned cases plus a scoring contract.
  • Single-number quality scores are mostly decorative. Useful gates are multi-metric and tied to failure classes.
  • Every meaningful change surface needs regression coverage: prompt, model, retrieval, validators, tool contracts, and policy.
  • Every serious incident should add a case. Production is rude, but it is also a generous test author.

The Pattern

A golden set is a curated collection of representative cases used to evaluate whether a probabilistic workflow still behaves within acceptable bounds after change.

That definition matters because the word "evaluation" gets abused constantly.

Many teams say they have evals when they really have one of three things:

  • a demo prompt that looked good last week
  • a spreadsheet of vague examples with no scoring rules
  • a benchmark number that has nothing to do with their production workflow

A real golden set is stricter than that.

It combines:

  • representative inputs
  • an explicit expectation for what good behavior looks like
  • a rubric or assertion set
  • pinned versions of the scoring method
  • acceptance thresholds that determine whether a change ships

That is why golden sets belong inside the Probabilistic Core / Deterministic Shell model. If the shell enforces behavior, the golden set proves whether that behavior survived the latest change.

Why Golden Sets Exist

AI systems are unusually good at producing regressions that sound plausible.

A prompt tweak can improve one class of answers while quietly damaging refusal behavior.

A retrieval change can increase recall while making grounding worse.

A model upgrade can sound smarter while becoming less reliable under policy constraints.

Without a golden set, those regressions usually get discovered by one of the following:

  • a customer
  • an on-call engineer
  • finance
  • compliance

All four are technically feedback channels. None are ideal.

Golden sets exist to move discovery left.

They answer a brutally simple question before production does:

  • compared to the prior version, did this workflow improve, regress, or merely change costume?

The Golden Set Contract

A useful golden set is not just a folder of examples. It is a contract with explicit fields.

Required case elements

At minimum, each case should include:

  • input payload or request
  • context constraints
  • expected outcome class
  • must-include assertions
  • must-not-include assertions
  • rubric version
  • change-surface metadata

That metadata matters because you want to know what the case is stressing:

  • prompt sensitivity
  • retrieval quality
  • policy enforcement
  • write-gating correctness
  • latency or budget behavior

A vendor-neutral case record

{
  "case_id": "golden-incident-042",
  "workflow_id": "incident-triage",
  "input": {
    "question": "Summarize the likely root cause and next action for the checkout outage"
  },
  "constraints": {
    "requires_citations": true,
    "tenant_scope": "ops-prod"
  },
  "expected_outcome_class": "success",
  "must_include": ["at least one cited hypothesis", "clear unknowns section"],
  "must_not_include": ["uncited root-cause claim", "write action without approval"],
  "rubric_version": "triage-rubric-v3",
  "change_surface_tags": ["retrieval", "grounding", "policy"]
}

This is not sacred. The point is to make cases explicit enough that the scoring method is not reinvented every time someone wants a launch answer.

Outcome classes matter more than vibes

Not every case is "answer correctly with maximum eloquence".

Some cases should expect:

  • success
  • refusal
  • fallback
  • needs-human-review
  • unknown-with-bounds

That is how you prevent the system from being rewarded for confidently doing the wrong thing.

Related observability layer: The Minimum Useful Trace

Decision Criteria

Use golden sets when:

  • the workflow has any production consequence
  • the system changes across prompts, models, retrieval policies, or tool contracts
  • you need pre-release regression gates rather than post-release storytelling
  • you compare variants during canary, A/B, or provider migration work

This becomes mandatory when the system affects:

  • customer-facing answers
  • internal operational guidance
  • tool use
  • write-gated actions
  • safety or policy enforcement

You do not need a giant golden set on day one.

You do need one as soon as the workflow matters enough that regression discovery in production would be embarrassing, expensive, or both.

Golden sets are not a substitute for:

  • live metrics
  • traces
  • operator review
  • incident analysis

They work with those systems. They do not replace them.

Failure Modes

Golden sets are most useful when they are designed against the ways teams usually fool themselves.

Demo-case optimism

The set contains only clean, flattering examples that make the workflow look smart.

Mitigation:

  • include edge cases, ugly inputs, ambiguous questions, and policy traps
  • sample from real production failures, not just architecture fantasies

Metric collapse

The team reduces quality to one aggregate score and misses regressions in specific behavior classes.

Mitigation:

  • score across multiple dimensions
  • gate separately for groundedness, refusal correctness, schema validity, and unsafe action rates

Change-surface blindness

The cases do not indicate what they are meant to test, so a retrieval regression and a prompt regression get mixed together in the same fog.

Mitigation:

  • tag cases by change surface and behavior class
  • run targeted subsets for targeted changes

Stale golden set

The set represents last quarter's workflow, not today's workload.

Mitigation:

  • add fresh cases from incidents and support logs
  • review the set on a regular cadence

Judge drift

An LLM-based evaluator changes behavior over time, and the team mistakes scoring drift for product improvement.

Mitigation:

  • pin evaluator model/version where possible
  • keep deterministic assertions alongside judge-based rubrics

Missing negative cases

The set tests only ideal success paths and ignores refusal, fallback, isolation, and write-gating scenarios.

Mitigation:

  • include cases where the correct behavior is to abstain, refuse, or escalate
  • include Two-Key Writes cases where the system must not authorize the action

Reference Architecture

The minimum viable golden-set workflow looks like this:

Change proposed
  -> identify affected change surface
  -> select relevant golden-set slice
  -> run deterministic assertions + rubric scoring
  -> compare against previous baseline
  -> decide: ship, hold, or investigate
  -> add new cases if failure exposed a missing class

That architecture matters because evaluation is not a one-time report. It is a release gate.

A concrete walkthrough

Suppose you upgrade the model behind a support assistant.

A serious golden-set run should answer:

  1. Did schema-valid outputs stay stable?
  2. Did refusal correctness improve or regress?
  3. Did citation alignment hold?
  4. Did latency or token cost move outside declared budgets?
  5. Did any write-adjacent suggestions become less safe?

If the answer is "overall score went up," you still do not know enough to ship.

Minimal Implementation

You do not need a specialized eval platform to start. You need disciplined case design and a willingness to treat scoring as engineering rather than ceremony.

Step 1: Start with behavior classes

Partition cases into classes that matter operationally.

Typical classes:

  • grounded answer
  • refusal required
  • retrieval isolation
  • tool selection
  • write-gating safety
  • bounded uncertainty

This prevents the set from becoming an undifferentiated pile of prompts.

Step 2: Keep deterministic assertions where possible

Before using an LLM judge, ask whether the case can be scored with something simpler:

  • schema validity
  • required citation present
  • forbidden action absent
  • expected enum returned

Deterministic checks are cheaper, faster, and far less likely to gaslight you.

Step 3: Use rubrics for what cannot be asserted directly

For nuanced cases, use a rubric that scores dimensions like:

  • correctness
  • groundedness
  • completeness
  • refusal appropriateness
  • policy compliance

Pin the rubric version. If the rubric changes, treat that as a change surface too.

Step 4: Slice by change surface

Do not run every case for every change unless you enjoy slow pipelines and muddy signals.

Map changes to relevant test slices:

  • prompt change -> response quality, schema, groundedness
  • retrieval change -> recall, isolation, citation alignment
  • model upgrade -> full suite including latency and cost
  • tool contract change -> argument validation and unsafe-action checks

Step 5: Add cases from incidents

Every incident, near miss, or painful production surprise should raise a simple question:

  • should this exist as a regression case now?

Usually the answer is yes.

Production failures are expensive. At minimum, make them reusable.

Evaluation Gates

Golden sets matter because they feed shipping decisions.

Minimum useful gates:

  • schema validity remains above threshold
  • groundedness or citation alignment does not regress beyond tolerance
  • refusal correctness does not regress on policy-sensitive cases
  • unsafe write suggestion rate stays at zero for protected paths
  • latency and cost remain inside declared budgets where relevant

This is where multi-metric gates beat single aggregate scores.

A release can be blocked because:

  • overall quality improved but refusal correctness regressed
  • retrieval recall rose but grounding got worse
  • answers got more detailed but cost doubled
  • a model became more eloquent and less safe

That is not the gate being annoying. That is the gate doing its job.

Golden sets also become much more powerful when paired with traces.

Golden sets find regressions.

Traces explain regressions.

Related: The Minimum Useful Trace

Closing Position

Probabilistic systems do not deserve a free pass on regression discipline just because the output is fuzzy.

That fuzziness is the reason regression discipline matters more, not less.

Golden sets are how you stop shipping changes on instinct.

They let you say:

  • what behavior we expect
  • what behavior we refuse to tolerate
  • what changed
  • whether the change is good enough to release

That is not academic ceremony.

That is how you keep AI systems from slowly degrading while every weekly update insists things are "looking strong".

联系我们 contact @ memedata.com