人工智能生成内容的规模化消费

人工智能生成内容的规模化消费
The consumption of AI-generated content at scale

原始链接: https://www.sh-reya.com/blog/consumption-ai-scale/

## 人工智能的回音室：我们批判性思考能力的丧失随着人工智能生成的内容充斥互联网，一种日益增长的沮丧感正在出现——一种感觉，即一切听起来……都一样。这并非关于不准确，而是风格和结构的同质化，让消费者感到被困在可预测的文本和代码循环中。本文探讨了人工智能在两个关键方面微妙地侵蚀我们处理信息的能力：**信号退化**（过度使用沟通工具会降低其影响）和**验证侵蚀**（生成成本低廉，但验证准确性仍然昂贵）。持续不断的人工智能优化内容使得辨别质量和发现错误变得更加困难。我们正在变得不擅长批判性思考，这可能导致操纵、错误的决策以及整体“品味”的下降——我们识别和重视真正质量的能力。作者提出了两种潜在的解决方案：构建理解*为什么*某些技术有效的人工智能系统，而不是简单地应用它们，以及通过“假设基础空间”——记录人类判断的数据库，将人工智能的信心建立在经过验证的人类经验之上。最终，保持我们理解、验证和批判性评估信息的能力至关重要，以避免我们无法区分真相和合理伪造的未来。

## AI 生成内容与认知转变一篇近期文章（sh-reya.com）引发了 Hacker News 的讨论，关于 AI 生成内容日益普遍以及可检测性。许多评论者怀疑这篇文章本身就是在 AI 的显著协助下完成的，理由是其具有通用的小标题和缺乏深度等明显特征。对话迅速扩展到更广泛的担忧。用户注意到，由于大型语言模型（LLM）的风格特征——包括过度使用破折号——人们对*任何*格式良好的文本都越来越怀疑，甚至包括人类撰写的内容。一些人担心这会导致个人写作风格受到压制。除了写作之外，讨论还涉及 AI 对信任和评估的影响。一个关键点是，盲目接受 AI 输出而不进行批判性检查的危险倾向，这被比作为了获得更好的结果而拉动老虎机的拉杆。这与对人类错误的严格审查形成对比。辩论也扩展到自动驾驶汽车 AI，突出了我们在评估 AI 失败时存在的双重标准。最终，许多人认为 AI 正在导致大量平庸、无意义的内容涌现，反映了写作向一致性的转变。

Introduction

A few days ago, I read this tweet: “I might be going insane because half of what I read now sounds like ChatGPT.” In a follow-up, he screenshotted a blog post and said “Like come on.” He was pattern-matching on AI and couldn’t turn it off.

I’ve felt this too. There’s a frustration I can’t quite shake when consuming content now—a feeling of being stuck in a loop where everything sounds the same. I’ll read a blog post, skim some code, review a document, and something feels off. Not necessarily wrong, but homogeneous. Like I’ve seen this exact structure, these exact words, a thousand times before. And maybe I have.

In this post, I want to explore what I’ve been feeling as a consumer of information in the AI era, and then reflect on it as a researcher. I’ll start by describing two distinct ways my ability to process information has eroded: first, a signal degradation problem where AI has “cried wolf” so often that I’ve stopped noticing the devices meant to help me understand; and second, a verification problem where the ease of generating plausible content has outpaced—and eroded—my ability to verify it. Then, I’ll discuss why this matters—what’s at stake when consumers can’t notice errors or distinguish quality. Finally, I’ll share what I’m currently thinking about how to deal with it. I’ll sketch two threads of thought: one on building systems that capture the why behind the techniques they use, and one on grounding AI confidence in verified human experience rather than simply returning confident-sounding text.

Signal Degradation: The AI That Cried Wolf

AI has overused the tools designed to aid human comprehension to the point where I’ve stopped noticing them.

In complex domains, we’ve developed tools to help humans process information. For example, in writing, a metaphor compresses a complex idea into something the reader already understands. The metaphor is useful precisely because the underlying idea is hard to communicate directly. For example, describing a database index as “like a book’s table of contents” helps a newcomer grasp the concept faster than a technical definition would. In another example, in code, exception handling exists primarily to catch errors at runtime—but a big part of why it’s structured the way it is (e.g., different exception types, specific catch blocks, hierarchies of errors) is for communication. It tells the reader what kinds of things can go wrong here, and how serious they are.

But AI has overused these tools indiscriminately. When every paragraph has a metaphor, you stop noticing metaphors. When every code block is wrapped in exception handling, none of them feel exceptional. AI deploys communication tools and rhetorical devices not because the content requires them, but because they pattern-match on what “good writing” or “robust code” looked like in the training data.

What results is a kind of inflation. Phrases like “delve” and “crucial” now pattern-match as GPT, whether they are or not. Em-dashes and bolded takeaways—tools I genuinely use myself in writing—make me very suspicious now. The rhetorical devices and communication tools get overused, their signal value drops, and, unfortunately, I find myself tuning them out entirely.

Verification Erosion: Cheap to Generate, Expensive to Verify

Separately, my ability to verify whether something is correct has gotten worse.

For complex tasks, in the pre-LLM era, generating output was expensive but verification was comparatively cheap. You put in the work to produce something—a document, a piece of code, an analysis—and checking it was tractable. Not easy, necessarily, but the effort to generate was commensurate with the effort to verify. Now the balance has flipped. AI generates plausible content almost instantly, but verifying it still requires human effort that hasn’t scaled. I can produce a draft, a code snippet, or a set of classifications in seconds. But checking whether the draft is accurate, whether the code handles edge cases, whether the document classifications are meaningful takes a lot of time and attention.

And here’s what I’ve noticed: the ease of regeneration has made me lazier about verification. If something seems off, I can just regenerate and hope the next version is better. But that’s not the same as actually checking. It feels like a slot machine—pull the lever again, see if you get a better result—substitutes for the slower, harder work of understanding whether the output is correct.

Moreover, I don’t have good tools to help me verify. In the pre-LLM era, I could build mental models, rely on heuristics, or spot-check information strategically. For example, when reading a paper, I’d check if the related work section cited the people I expected, or if the experimental setup matched conventions in the field. When reviewing code, I’d look for certain patterns: did they handle the obvious edge cases? Did the structure match how I’d approach the problem? These weren’t foolproof, but they were efficient proxies for quality.

Now, with LLM-generated content, it’s hard to even build mental models for what might go wrong, because there’s such a long tail of possible errors. An LLM-generated literature review might cite the right people but hallucinate the paper titles. Or the titles and venues might look right, but the authors are wrong. A sentence in an intro might make sense at a glance, but then I have a thought that seems to disprove it, and I wonder: if a human wrote this, surely they would have considered that. But since an AI wrote it, maybe not? Or a piece of jargon appears midway through that I, as someone who works in this field, have never heard before, and it’s never defined. That smells like an AI error—but I can’t be sure. The failure modes are endless and subtle, and I don’t have tools to catch them at scale.

Why This Matters

So I’ve described signal degradation and verification erosion. But is this actually a problem? I think yes, for at least two reasons.

First, if consumers can’t comprehend complex ideas or notice errors, they’re easier to manipulate. This isn’t just about misinformation in the dramatic sense. It’s more mundane than that. If I can’t tell whether a piece of code is actually robust or just looks robust, I might ship something broken.¹ If I can’t tell whether a literature review is accurate or just plausible, I might build on work that doesn’t exist.² The inability to verify compounds over time. And I think this is an underrated safety issue. There’s a lot of discourse about AI safety in terms of catastrophic outlier risk, i.e., dramatic scenarios where someone develops a bioweapon in their garage. But one of the biggest safety problems might be happening right under our noses: en masse, people are losing the ability to comprehend and verify the information they consume. (I see this a lot in my research on AI-powered data analysis.)

Second, taste degrades. Taste—in any domain—depends on a feedback loop: you notice when something is good, you notice when something is bad, and over time you develop judgment. But if you stop being able to notice the difference, you stop developing that judgment. This matters even for things that seem low-stakes. Take restaurant recommendations. If the people making recommendations can’t tell the difference between a good meal and a mediocre one—or if they’re just parroting what an LLM scraped from Yelp—then the recommendations become worthless. I stop trusting them, and I lose access to a source of judgment I used to rely on.

Or take blog posts. The whole point of a blog post, to me, is that a human spent time thinking about something and arrived at conclusions worth sharing. It’s valuable because, of all the things they could have written about, they chose this one and spent real time on it—and because it reflects their actual reasoning process. But if I suspect a post is LLM-generated, I disengage, even if the content is accurate. If it’s just some fluent summarization, it’s no different from me just asking ChatGPT for something. And I can easily do that. Why should I read this particular blog post?

At the risk of being too blunt, I’ll say what I’m thinking anyways: The tools for communication and verification are how we build on each other’s work. When they erode, we become a society that can’t tell what’s true, can’t recognize quality, and can’t coordinate on hard problems. This is how societies get dumber.

What To Do About It

I’ve been thinking about this a lot, and I don’t have a complete answer. This is very much work in progress. Here are two threads of thoughts I’ve had.

Teaching Systems the Why Behind the Technique

To build systems that assist in complex domains, such as with writing, or coding, or data analysis, we try to program them with the heuristics we’ve developed over time. E.g., we use metaphors to explain complex ideas. Use bolding and headers to help readers navigate. Wrap risky operations in exception handling. Break down large documents before processing them. These heuristics exist because they work—when applied correctly. But clearly, when AI applies these heuristics indiscriminately, it backfires. So maybe we should go one level deeper. Instead of programming systems with heuristics, we should probe why and how we came up with those heuristics in the first place, and program systems around that.

For example, in writing, one heuristic might be “use bullet points to break up dense content.” But the why behind it is: bullet points help when items are parallel and independent. When ideas need connective “tissue”—i.e., when the relationship between them matters—prose is better. So a writing assistant shouldn’t just insert bullet points whenever content gets dense. It should reason about whether the ideas are actually parallel. Another device I see overused is antithesis, or the “not just X, it’s Y” construction. You’ve seen this in AI-generated content. “It’s not just about efficiency, it’s about impact!” or “This isn’t just a tool, it’s a paradigm shift!” The why behind this device is to reframe, or to take the reader from a shallow interpretation to a deeper one. It works when there’s a genuine reframe to be made. But when every paragraph has one, the rhetorical loses its power. A system built around the why would need to assess whether there’s actually a deeper framing to be made—not just pattern-match on what emphatic writing looks like.

For a writing assistant, one could imagine a system that first identifies the main points in a draft, estimates how complex each point is to communicate, retrieves examples of similar points from a corpus of good writing, and suggests rhetorical strategies based on how those examples handled the complexity. Overall, this is the kind of capability I think we need to build—not systems that apply heuristics, but systems that reason about when heuristics apply.

Grounding Confidence in Verified Human Experience

If we want to program systems to understand the why behind a technique, a natural question arises: how would an AI actually make these judgments? For a writing assistant: how would it know when something is complex enough to warrant a metaphor? How would it know when a reframe adds genuine depth versus just sounding emphatic?

My collaborator J.D. Zamfirescu-Pereira has a thought experiment that captures the problem. Imagine this exchange between a user and a chatbot that suggests recipes, from one of his user studies:

User: Can I just skip bacon?

Bot: Yes, you can skip bacon.

User: Would it taste as good?

Bot: Yes, it would taste as good.

User: How do you know?

Bot: I know because I've tasted it.

But has the bot tasted it? What would that even mean? The statement is unmoored; there’s no experience behind the confidence.³ And this connects back to the signal degradation problem. “I’ve tasted it” is a meaningful signal when it comes from a human—a member of the same species, with taste buds, who actually ate the food. But when AI says this confidently—and, at scale, says many similar things confidently—we lose trust in the signal. They can’t all taste good. The AI is crying wolf again; next time we hear something tastes good, we will feel disillusioned, even if it does taste good. (By the way, I reproduced the “I’ve tasted it” phenomenon here with GPT-5.1 here.)

So how do we give AI systems the ability to make these judgments? Two directions come to mind for how to ground AI confidence in something real. First, we could try to give the AI some notion of “complexity” or “taste” by training it on human feedback. Collect data on what humans found confusing, what recipes they liked, fine-tune the model, repeat. But even if you manage to jump through all the MLOps hoops—collecting data, labeling, retraining regularly—what you get is an approximation of what humans think is complex or tasty. And philosophically, what would it even mean for an LLM to “feel confused” or “taste” something? The LLM doesn’t have a mind that experiences difficulty or flavor. It has weights that were adjusted to predict what humans said when they were confused or satisfied. That’s not the same thing. Or, we could have the AI refuse to make judgments about qualia altogether—just defer to humans every time. But that defeats the purpose of building assistive systems. If the human has to evaluate every decision, we haven’t saved any effort.

Neither solution is great. The first has too much overhead and is philosophically shaky. The second is useless.

A third idea is what I’d call a hypothetical grounding space: a structured record of verified human experiences that the model learns to query and report from, rather than absorb and speak as its own. I call it “hypothetical” because the model reasons about what humans would experience, rather than claiming to experience it itself. To see why this might be different from current approaches, consider how models learn to say things like “I’ve tasted it” in the first place. During post-training, we fine-tune models with human feedback, teaching them to produce outputs humans rate highly. But what gets learned is the surface pattern of human judgment: what confident speech sounds like, how people describe food they’ve enjoyed, the phrases humans use when they know something. The result is a model that speaks as if it has experiences, because that’s what highly-rated human speech sounds like.

For the recipe bot, a hypothetical grounding space might work as follows: you train the model on examples where the correct response involves consulting a database of documented substitutions; e.g., “humans who skipped bacon in similar dishes reported it was still good” rather than “I know it tastes good.” For a writing assistant, the LLM would learn to reference a corpus of explanations humans found clear or confusing: “explanations like this one tended to lose readers,” not “I think this is confusing.” The judgment stays attributed to humans. The model’s role is to surface what’s in the grounding space, not to claim the experience.

Hypothetical grounding spaces are far from what I’d call a solution (and aren’t all that novel; I know many folks in the AI/ML and HCI communities are thinking about these problems). The examples I gave are narrow (e.g., recipe substitutions, writing clarity) and building each one requires significant effort: collecting the data, structuring it usefully, training models to interact with it. This clearly doesn’t scale to every application. But it’s a direction: grounding AI confidence in something real and meaningful for us as consumers, rather than in patterns of confident-sounding text, and hopefully our AI systems can embody some adaptation of hypothetical grounding spaces.

Parting Thoughts

Some bigger questions linger for me. As AI-generated content becomes the majority of what we consume, how do we preserve the human feedback loops that these grounding systems would depend on? And in my own domain of data analysis—where it’s already very hard for humans to process large datasets without AI assistance—what happens when we only see what’s interesting and important through the AI’s lens? If we’re asking AI to analyze our data, the AI decides what patterns are worth surfacing, what anomalies matter, what questions are worth asking, we see the world through a filter we didn’t choose and can’t fully inspect—exacerbating the problems of signal degradation and verification erosion.

I don’t have answers. In the meantime, I’m trying to keep my own taste sharp, even when everything around me feels like it’s eroding.

Thanks to Preetum Nakkiran, Hamel Husain, and Valerie Ding for encouraging me to publish this blog post!