“LLM”中的“L”代表谎言
The L in "LLM" Stands for Lying

原始链接: https://acko.net/blog/the-l-in-llm-stands-for-lying/

生成式人工智能面临一个核心问题:它依赖于大量可能受版权保护的材料,实际上构成了广泛的抄袭。这种“氛围编码”产生的是缺乏原创性的通用输出,难以辨别其独特性,从而造成法律上的模糊性——目前的标签/水印更像是补救措施,而非真正的问责。 作者认为,法院不应就人工智能版权做出裁决,因为其输出本质上是无来源的,应被视为伪造品,除非另有证据证明。解决方案在于*要求*大型语言模型在进行推断的同时提供准确的来源归属。 目前,大型语言模型的引用仅仅是“角色扮演”,源于数据模式,而非真正的理解。实施适当的来源标注在技术上具有挑战性,需要对模型架构和处理能力进行重大改变。然而,这一点至关重要;它将暴露复制代码的程度,揭示人工智能生成内容的真实性质,解决建立在未知起源之上的技术的根本“粗糙”问题。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 “LLM”中的“L”代表谎言 (acko.net) 17 分,由 LorenDB 发表于 2 小时前 | 隐藏 | 过去 | 收藏 | 3 条评论 帮助 chromehearts 1 分钟前 | 下一个 [–] 令人难以置信的网站回复 barcodehorse 3 分钟前 | 上一个 | 下一个 [–] 可爱的蜥蜴机器。回复 feverzsj 16 分钟前 | 上一个 | 下一个 [–] 更像是疯子。回复 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

So it's no wonder artists would denounce generative AI as mass-plagiarism when it showed up. It's also no wonder that a bunch of tech entrepreneurs and data janitors wouldn't understand this at all, and would in fact embrace the plagiarism wholesale, training their models on every pirated shadow library they can get. Or indeed, every code repository out there.

If the output of this is generic, gross and suspicious, there's a very obvious reason for it. The different training samples in the source material are themselves just slop for the machine. Whatever makes the weights go brrr during training.

This just so happens to create the plausible deniability that makes it impossible to say what's a citation, what's a hallucination, and what, if anything, could be considered novel or creative. This is what keeps those shadow libraries illegal, but ChatGPT "legal".

Labeling AI content as AI generated, or watermarking it, is thus largely an exercise in ass-covering, and not in any way responsible disclosure.

It's also what provides the fig leaf that allows many a developer to knock-off for early lunch and early dinner every day, while keeping the meter running, without ever questioning whether the intellectual property clauses in their contract still mean anything at all.

This leaves the engineers in question in an awkward spot however. In order for vibe-coding to be acceptable and justifiable, they have to consider their own output disposable, highly uncreative, and not worthy of credit.

* * *

If you ask me, no court should have ever rendered a judgement on whether AI output as a category is legal or copyrightable, because none of it is sourced. The judgement simply cannot be made, and AI output should be treated like a forgery unless and until proven otherwise.

The solution to the LLM conundrum is then as obvious as it is elusive: the only way to separate the gold from the slop is for LLMs to perform correct source attribution along with inference.

This wouldn't just help with the artistic side of things. It would also reveal how much vibe code is merely just copy/pasted from an existing codebase, while conveniently omitting the original author, license and link.

With today's models, real attribution is a technical impossibility. The fact that an LLM can even mention and cite sources at all is an emergent property of the data that's been ingested, and the prompt being completed. It can only do so when appropriate according to the current position in the text.

There's no reason to think that this is generalizable, rather, it is far more likely that LLMs are merely good at citing things that are frequently and correctly cited. It's citation role-play.

The implications of sourcing-as-a-requirement are vast. What does backpropagation even look like if the weights have to be attributable, and the forward pass auditable? You won't be able to fit that in an int4, that's for sure.

Nevertheless, I think this would be quite revealing, as this is what "AI detection tools" are really trying to solve for backwards. It's crazy that the next big thing after the World Wide Web, and the Google-scale search engine to make use of it, was a technology that cannot tell you where the information comes from, by design. It's... sloppy.

To stop the machines from lying, they have to cite their sources properly. And spoiler, so do the AI companies.

联系我们 contact @ memedata.com