结构化输出会产生虚假自信。

结构化输出会产生虚假自信。
Structured Outputs Create False Confidence

原始链接: https://boundaryml.com/blog/structured-outputs-create-false-confidence

## 结构化输出与LLM：质量的权衡尽管结构化输出API（如OpenAI提供的）具有吸引力，但实际上可能*降低*响应质量。虽然它们保证了特定格式，但这需要付出代价。主要问题包括：数据提取错误增加（即使在简单情况下，例如错误读取收据数量）、不准确的错误建模、阻碍推理能力（如思维链）、以及增加提示注入攻击的脆弱性。核心问题在于“受限解码”——强制LLM适应严格的模式限制了其自然且准确地响应的能力。允许自由形式的文本输出，然后对其进行解析，可以产生更好的结果，使LLM能够充分利用其能力，甚至表达不确定性或错误。本质上，结构化输出会产生“虚假自信”——对格式的保证而非对质量的保证。解决方案？让LLM自然响应，并在之后解析输出，优先考虑响应质量而非严格的符合性。这种方法也增加了对抗恶意提示的一层安全性。

一篇最近的博客文章（boundaryml.com）在Hacker News上引发了关于大型语言模型（LLM）“结构化输出”可靠性的讨论。一些用户报告了积极的体验，包括成功地与文本、图像以及通过Vercel AI等SDK进行工具调用。但也有用户表示怀疑。一个主要担忧是，该博客文章缺乏关键细节，例如提示词和模式定义，使得其声明难以验证。用户注意到不同模型之间存在不一致性；OpenAI的结构化输出被认为比Gemini的更稳定，后者有时会产生格式错误的JSON。一个用于修复损坏的JSON的工具（`github.com/josdejong/jsonrepair`）也被建议作为一种潜在的解决方法。总体情绪表明，结构化输出在*取决于*任务的情况下是有用的，但需要仔细的提示词设计，并且并非普遍可靠。

If you use LLMs, you've probably heard about structured outputs. You might think they're the greatest thing since sliced bread. Unfortunately, structured outputs also degrade response quality.

Specifically, if you use an LLM provider's structured outputs API, you will get a lower quality response than if you use their normal text output API:

⚠️ you're more likely to make mistakes when extracting data, even in simple cases;
⚠️ you're probably not modeling errors correctly;
⚠️ it's harder to use techniques like chain-of-thought reasoning; and
⚠️ in the extreme case, it can be easier to steal your customer data using prompt injection.

These are very contentious claims, so let's start with an example: extracting data from a receipt.

$Receipt with fractional quantities$

If I use an LLM to extract the receipt entries, it should be able to tell me that one of the items is (name="banana", quantity=0.46), right?

Well, using OpenAI's structured outputs API with gpt-5.2 - released literally this week! - it will claim that the banana quantity is 1.0:

{
  "establishment_name": "PC Market of Choice",
  "date": "2007-01-20",
  "total": 0.32,
  "currency": "USD",
  "items": [
    {
      "name": "Bananas",
      "price": 0.32,
      "quantity": 1
    }
  ]
}

However, with the same model, if you just use the completions API and then parse the output, it will return the correct quantity:

{
  "establishment_name": "PC Market of Choice",
  "date": "2007-01-20",
  "total": 0.32,
  "currency": "USD",
  "items": [
    {
      "name": "Bananas",
      "price": 0.69,
      "quantity": 0.46
    }
  ]
}

Click here to see the code that was used to generate the above outputs.

This code is also available on GitHub.

#!/usr/bin/env -S uv run

# /// script
# requires-python = ">=3.10"
# dependencies = ["openai", "pydantic", "rich"]
# ///

"""
If you have uv, you can run this code by saving it as structured_outputs_quality_demo.py and then running:

  chmod u+x structured_outputs_quality_demo.py
  ./structured_outputs_quality_demo.py

This script is a companion to https://boundaryml.com/blog/structured-outputs-create-false-confidence
"""

import json
import re
from openai import OpenAI
from pydantic import BaseModel, Field
from rich.console import Console
from rich.pretty import Pretty


class Item(BaseModel):
    name: str
    price: float = Field(description="per-unit item price")
    quantity: float = Field(default=1, description="If not specified, assume 1")


class Receipt(BaseModel):
    establishment_name: str
    date: str = Field(description="YYYY-MM-DD")
    total: float = Field(description="The total amount of the receipt")
    currency: str = Field(description="The currency used for everything on the receipt")
    items: list[Item] = Field(description="The items on the receipt")


client = OpenAI()
console = Console()


def run_receipt_extraction_structured(image_url: str):
    """Call the LLM to extract receipt data from an image URL and return the raw response."""
    prompt_text = (
        """
Extract data from the receipt.
"""
    )

    response = client.beta.chat.completions.parse(
        model="gpt-5.2-2025-12-11",
        messages=[
            {
                "role": "system",
                "content": "You are a precise receipt extraction engine. Return only structured data matching the Receipt schema.",
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": prompt_text,
                    },
                    {"type": "image_url", "image_url": {"url": image_url}},
                ],
            },
        ],
        response_format=Receipt,
    )
    return response.choices[0].message.content, response.choices[0].message.parsed


def run_receipt_extraction_freeform(image_url: str):
    """Call the LLM to extract receipt data from an image URL and return the raw response."""
    prompt_text = (
        """
Extract data from the receipt.

Explain your reasoning, then answer in JSON:
{
  establishment_name: string,
  // YYYY-MM-DD
  date: string,
  // The total amount of the receipt
  total: float,
  // The currency used for everything on the receipt
  currency: string,
  // The items on the receipt
  items: [
    {
      name: string,
      // per-unit item price
      price: float,
      // If not specified, assume 1
      quantity: float,
    }
  ],
}
"""
    )

    response = client.beta.chat.completions.parse(
        model="gpt-5.2-2025-12-11",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": prompt_text,
                    },
                    {"type": "image_url", "image_url": {"url": image_url}},
                ],
            },
        ],
    )
    return response.choices[0].message.content, json.loads(re.search(r"```json(.*?)```", response.choices[0].message.content, flags=re.DOTALL).group(1))



def main() -> None:
    images = [
        {
            "title": "Parsing receipt: fractional quantity",
            "url": "https://boundaryml.com/receipt-fractional-quantity.jpg",
            "expected": "You should expect quantity to be 0.46."
        },
        {
            "title": "Parsing receipt: elephant",
            "url": "https://boundaryml.com/receipt-elephant.jpg",
            "expected": "You should expect an error."
        },
        {
            "title": "Parsing receipt: currency exchange",
            "url": "https://boundaryml.com/receipt-currency-exchange.jpg",
            "expected": "You should expect a warning about mixed currencies."
        },
    ]

    print("This is a demonstration of how structured outputs create false confidence.")

    for entry in images:
        title = entry["title"]
        url = entry["url"]

        completion_structured_content, _ = run_receipt_extraction_structured(url)
        completion_freeform_content, _ = run_receipt_extraction_freeform(url)

        console.print("[cyan]--------------------------------[/cyan]")
        console.print(f"[cyan]{title}[/cyan]")
        console.print(f"Asking LLM to parse receipt from {url}")
        console.print(entry['expected'])
        console.print()
        console.print("[cyan]Using structured outputs:[/cyan]")
        console.print(completion_structured_content)
        console.print()
        console.print("[cyan]Parsing free-form output:[/cyan]")
        console.print(completion_freeform_content)


if __name__ == "__main__":
    main()

Now, what happens if someone submits a picture of an elephant?

Or a currency exchange receipt?

In these scenarios, you want to let the LLM respond using text. You want it to be able to say that, hey, you're asking me to parse a receipt, but you gave me a picture of an elephant, I can't parse an elephant into a receipt.

If you force the LLM to respond using structured outputs, you take that ability away from the LLM. Sure, you'll get an object that satisfies your output format, but it'll be meaningless. It's like when you file a bug report, and the form has 5 mandatory fields about things that have nothing to do with your bug, but you have to put something in those fields to file the bug report: the stuff you put in those fields will probably be useless.

Yes and no.

Yes, you can tell your LLM to return { receipt data } or { error } . But what kinds of errors are you going to ask it to consider?

What kind of error should it return if there's no total listed on the receipt? Should it even return an error or is it OK for it to return total = null?
What if it can successfully parse 7 of 8 items on the receipt, but it's not sure about the 8th item? Should it return (1) the 7 successfully parsed items and a partial parse of the 8th item, (2) only the 7 successfully parsed items and discard the 8th or (3) fail parsing entirely?
What if someone submits a picture of an elephant? What kind of error should be returned in that case?

In addition, as you start enumerating all of these errors, you run into the pink elephant problem: the more your prompt talks about errors, the more likely the LLM is to respond with an error.

Think of it this way: if someone presses Ctrl-C when running your binary, it is a Good Thing that the error can propagate all the way up through your binary, without you having to explicitly write try { ... } catch CtrlCError { ... } in every function in your codebase.

In the same way that you often want to allow errors to just propagate up while writing software, and only explicitly handle some errors, your LLM should be allowed to respond with errors in whatever fashion it wants to.

"Explain your reasoning step by step" is a magic incantation that seemingly makes LLMs much smarter. It also turns out that this trick doesn't work nearly as well when using structured outputs, and we've known this since Aug 2024.

To understand this finding, the intuition I like to use, is to think of every model of having an intelligence "budget", and that if you try to force an LLM to reason in a very specific format, you're making the LLM spend intelligence points on useless work.

To make this more concrete, let's use another example. If you prompt an LLM to give you JSON output and reason about it step-by-step, its response will look something like this:

If we think step by step we can see that:

1. The email is from Amazon, confirming the status of a specific order.
2. The subject line says "Your Amazon.com order of 'Wood Dowel Rods...' has shipped!" which indicates that the order status is 'SHIPPED'.
3. [...]

Combining all these points, the output JSON is:

```json
{
     "order_status": "SHIPPED",
     [...]
}
```

Notice that although the response contains valid JSON, the response itself is not valid JSON, because of the reasoning text at the start. In other words, you can't use basic chain-of-thought reasoning with structured outputs.

You could modify your schema, and add reasoning: string fields to your output schema, and let the LLM respond with something like this:

{
  "reasoning": "If we think step by step we can see that:\n\n 1. The email is from Amazon, confirming the status of a specific order.\n2. The subject line says \"Your Amazon.com order of 'Wood Dowel Rods...' has shipped!\" [...]
  ...
}

In other words, if you're using a reasoning field with structured outputs, instead of simply asking the LLM to reason about its answer, you're also forcing it to escape newlines and quotes and format that correctly as JSON. You're basically asking the LLM to put a cover page on its TPS report.

(To understand this section, you'll need a bit of background on transformer models, specifically how logit sampling works. Feel free to skip this section if you don't have this background.)

Model providers like OpenAI and Anthropic implement structured outputs using a technique called constrained decoding:

By default, when models are sampled to produce outputs, they are entirely unconstrained and can select any token from the vocabulary as the next output. This flexibility is what allows models to make mistakes; for example, they are generally free to sample a curly brace token at any time, even when that would not produce valid JSON. In order to force valid outputs, we constrain our models to only tokens that would be valid according to the supplied schema, rather than all available tokens.

In other words, constrained decoding applies a filter during sampling that says, OK, given the output that you've produced so far, you're only allowed to consider certain tokens.

For example, if the LLM has so far produced {"quantity": 51, and you're constraining output decoding to satisfy { quantity: int, ... }:

{"quantity": 51.2 would not satisfy the constraint, so .2 is not allowed to be the next token,
{"quantity": 51, would satisfy the constraint, so , is allowed to be the next token,
{"quantity": 510 would satisfy the constraint, so 0 is allowed to be the next token (albeit, in this example, with low probability!),

But if the LLM actually wants to answer with 51.2 instead of 51, it isn't allowed to, because of our constraint!

Sure, if you're using constrained decoding to force it to return {"quantity": 51.2} instead of {"quantity": 51.2,} - because trailing commas are not allowed in JSON - it'll probably do the right thing. But that's something you can write code to handle, which leads me to my final point.

OK, so if structured outputs are bad, then what's the solution?

It turns out to be really simple: let the LLM do what it's trained to do. Allow it to respond in a free-form style:

Using structured outputs, via constrained decoding, makes it much harder for the LLM to do any of this. Even though you've crafted a guarantee that the LLM will return a response in exactly your requested output format, that guarantee comes at the cost of the quality of that response, because you're forcing the LLM to prioritize complying with your output format over returning a high-quality response. That's why structured outputs create false confidence: it's entirely non-obvious that you're sacrificing output quality to achieve output conformance.

Parsing the LLM's free-form output, by contrast, enables you to retain that output quality.

(In a scenario where an attacker is trying to convince your agent to do something you didn't design it to do, the parsing also serves as an effective defense-in-depth layer against malicious prompt injection.)

This is why BAML - our open-source, local-only DSL - uses schema-aligned parsing: we believe letting the LLM respond in as natural a fashion as possible is the most effective way to get the highest quality response from it. For an example of this in action, take a look at this writeup from Kevin Madura about improving extraction quality by 20%.

结构化输出会产生虚假自信。 Structured Outputs Create False Confidence

结构化输出会产生虚假自信。
Structured Outputs Create False Confidence