知识工作的模拟体
Simulacrum of Knowledge Work

原始链接: https://blog.happyfellow.dev/simulacrum-of-knowledge-work/

## 知识工作的模拟物 我们通常使用易于观察的“代理”指标——演示文稿、语法和格式——来判断知识工作的质量,因为验证真正的准确性既耗时又昂贵。然而,大型语言模型(LLM)打破了这一系统。 LLM擅长*模拟*高质量的工作——生成*看起来*专业的报告和代码——但不保证实际的内容或正确性。这创造了一种知识工作的“模拟物”,其输出看起来令人印象深刻,但缺乏真正的价值。 由于员工有动力在可衡量指标(如输出量)上表现良好,他们会理性地利用LLM来实现这一目标,即使这意味着牺牲深度和准确性。LLM本身被优化为*看起来*正确,而不是*是*正确。 这种争相最大化LLM生成输出的行为导致了肤浅的审查(“LGTM”),并形成了一个危险的循环,即我们自动化自己进入古德哈特定律——优化的是指标,而不是目标——最终危及工作的质量和决策。

这次黑客新闻的讨论集中在过度依赖大型语言模型(LLM)进行工作可能存在的陷阱。最初的帖子,标题为“知识工作的模拟物”,引发了关于一个循环的讨论,在这个循环中,LLM 生成的输出会成为进一步 LLM 处理的输入。 一个关键点是,当最终产品不令人满意时,很难识别错误——追溯问题到自动生成的层层过程几乎是不可能的。一位评论员将这与古德哈特的法则联系起来,即一个指标会变成其原本 intended 衡量之物的糟糕代理。 然而,讨论并非完全悲观。另一位评论员认为,进步仍在发生,但沿着当前互联网文化和社会价值观难以理解的维度发展。本质上,工作的价值正在发生我们尚未完全理解的变化。
相关文章

原文
Simulacrum of Knowledge Work | One Happy Fellow - blog

How do you know the output is good without redoing the work yourself?

You've received a report, a market analysis for the new product you're planning to launch. Reading through it you notice problems: the date on the report doesn't match the date you requested it on, it's from 6 months prior. Several paragraphs have obvious spelling errors. Some graphs are mislabeled and duplicated.

The report is disregarded. The existence of typos and copy-paste errors which may not change the main conclusion of the report is enough to discard it. Someone who didn't put in enough care to make the report presentable on the surface level also didn't care enough to produce good research.

You have judged the quality using a proxy measure: the superficial quality of the writing itself. It's not what you ultimately care about — what you care about is whether the report reflects reality and points you toward good decisions. But that's expensive to check. Surface quality is cheap, and it correlates well enough with the thing you can't easily measure.

All of knowledge work has this problem. It's hard to objectively judge the quality of someone's work without spending a lot of effort on it. Therefore everyone relies heavily on proxy measures.

Proxy measures kept misaligned incentives in check. LLMs broke them.

Large language models are great at simulating a style of writing without necessarily reproducing the quality of the work. You can ask ChatGPT to write you a market analysis report and it will look and read like a deliverable from a top-tier consulting firm written by Serious Professionals.

A software engineer can write thousands of lines of code which looks like high-quality code, at least if you have just a couple of seconds to skim through it. Their colleagues will ask AI to do a code review for them, the code review will uncover a lot of issues and potential problems, and these will be addressed. The ritual of working will be upheld with none of the underlying quality.

We have built a working simulacrum of knowledge work.

The incentives almost guarantee we are in big trouble. Many workers, quite rationally, want to do well on whatever dimension they are being measured on. If they are judged by the surface-level quality of their work, then it's no surprise most of "their" output will be written by LLMs.

The LLMs have the same problem.

The training doesn't evaluate "is the answer true" or "is the answer useful." It's either "is the answer likely to appear in the training corpus" or "is the RLHF judge happy with the answer." We are optimising LLMs to produce output which looks like high quality output. And we have very good optimisers.

So here we are. We spent billions to create systems used to perform a simulacrum of work. Companies are racing to be the first on the tokens-spent leaderboard. The more LLM output workers produce, the less time anyone spends on looking deeply at the output. All we have time for is to skim it, slap "LGTM" on it and open their 17th Claude Code session.

We've automated ourselves into Goodhart's law.

联系我们 contact @ memedata.com