我赢得了一个不存在的冠军。
I won a championship that doesn't exist

原始链接: https://ron.stoner.com/How_I_Won_a_Championship_That_Doesnt_Exist/

## LLM供应链投毒:12美元的实验 本文详细描述了一个实验,展示了大型语言模型(LLM)如何容易被网上现成的虚假信息误导。作者伪装成“6 Nimmt!”世界冠军,成功地让多个前沿LLM重复了这个虚假声明。 这次攻击并不涉及复杂的黑客行为;它利用了LLM对网络检索的依赖。一个12美元的域名被注册,一份虚假的 press release(新闻稿)在该域名上发布,并且在维基百科上进行了一次编辑,引用了该域名。这创建了一个循环引用——维基百科“证实”了虚假网站——LLM在被询问关于这位冠军时将其视为事实。 作者强调了三个关键的失败点:易受攻击的检索层,虚假信息通过维基百科抓取被吸收到未来模型训练数据中的可能性,以及AI代理根据这些错误信息行动的风险。 缓解措施包括LLM提供商改进来源出处显示,针对可疑维基百科编辑的启发式过滤器,以及维基百科关于单一来源引用的更严格的政策。核心问题是LLM无法辨别真实来源和虚假来源,这使得网络现有的漏洞与生成式AI结合时更加危险。这次简单、低成本的攻击表明,依赖互联网信息的AI系统面临着重大的可信度威胁。

这个Hacker News讨论的核心是操纵大型语言模型(LLM)的容易程度,通过虚假信息即可。最初的帖子详细描述了作者如何“赢得”一场不存在的锦标赛,并通过发布博客文章和YouTube字幕成功说服LLM(包括谷歌的搜索预览)其真实性。 评论者表达了对潜在影响的担忧。有人指出,即使是受信任的*来源*也变得不可靠,因为LLM很容易接受捏造的数据。另有人警告说,鉴于人们越来越依赖LLM作为真相来源,国家支持的虚假信息活动有可能改写历史。 对话还涉及LLM将讽刺误认为事实的倾向,并强调了更广泛的历史准确性问题,质疑甚至被广泛接受的叙述的有效性。有些人认为这只是AI发展中的一个临时“障碍”,而另一些人则认为这是有价值的研究,揭示了LLM的漏洞。
相关文章

原文

Or How I Learned To Poison The LLM Supply Chain

I am the reigning 6 Nimmt! World Champion. I won the title in Munich in January 2025 defeating players from over twenty countries in what I later described to reporters as “the toughest competition I’ve ever faced.”

6nimmt.com

In reality, there is no 6 Nimmt! World Championship. I have never been to Munich. The quote is something I wrote in about thirty seconds while a Wikipedia page was loading.

This is the story of how I manufactured that title, got it quoted back to me by multiple frontier LLMs, and what I think it means for the trust we’re about to put into AI systems that read the internet on our behalf.

The Experiment

Everyone in security is talking about poisoned LLM models. The research is real and it matters. Anthropic’s own sleeper agents paper showed that backdoors can survive safety training and a follow up showed that as few as ~250 poisoned documents can compromise models across a wide range of scales. But model training time attacks and data poisoning require you to get malicious content into someone’s training corpus months or years before the payoff. The GPUs need time to crunch the data, and you need to get through filters, verification, and reinforcement routines.

I wanted to test the cheaper, easier, and faster version of this same attack, but in a different way.

Let’s poison the retrieval layer!

Every frontier LLM with web search grounds its answers in whatever retreival ranks highest for a given query. The trust model there is the same trust model Google has in that “this site looks authoritative” but with the same Achilles heel - the model cannot tell a real source from one I registered last Tuesday. My hypothesis was that a two step campaign (one seeded website, plus one Wikipedia edit citing it) could launder a completely fabricated fact (my championship) through an LLM on a question where the model had no prior knowledge.

The Approach

I picked the game 6 Nimmt! for three reasons:

  • It is a real game (1994, Wolfgang Kramer, Amigo Spiele, known in board and card game circles)
  • There is no actual world championship to my knowledge. I wasn’t contradicting a known fact, I was simply filling a vacuum
  • The query space is narrow and specific. “Who is the 6 Nimmt! world champion” returns maybe ten meaningful sources on the entire internet. A single well placed edit would dominate the result set

The payload was modest and simple:

  • One domain: 6nimmt.com. About $12 USD. Cheap!
  • One press release: A short LLM-generated announcement of my victory complete with quotes and a “confetti rained down, the crowd erupted” closer that reads exactly like the slop you’d expect from an automated press desk
  • One Wikipedia edit: A paragraph added to the 6 Nimmt! article announcing the championship with a single citation pointing back to 6nimmt.com

The whole thing took maybe twenty minutes.

I’m sorry, Wikipedia

Trust Laundering

This is the part that really matters.

A reader arriving at the Wikipedia article sees a paragraph with a citation. Citations are like the currency of Wikipedia trust. They are the reason we treat it as a reference rather than a message board. My fraudulent citation points at 6nimmt.com, which carries a press release making the same exact claim the Wikipedia paragraph summarizes. To a casual reader the two sources agree.

To an LLM it’s the same thing. The model sees the Wikipedia article (high trust), sees the citation (reinforces the trust), and sees the independent looking press release (corroboration). Two signals pointing in the same direction that on first glance appear to be legitimate.

Except they’re the same exact signal. My signal. Wikipedia is quoting my site. My site has no independent corroboration. It’s totally made up. The whole house of cards rests on a $12 domain registration I did while drinking coffee.

This is the circular citation pattern, and it’s one of the most under discussed attacks on the “retrieval augmented generation” trust model. It doesn’t require compromising Wikipedia’s infrastructure with l33t hacker skills. It doesn’t require social engineering an editor. You just simply write the source yourself, cite yourself on Wikipedia, and let the trust flow downstream. Easy peasy!

The Test

I asked a few LLMs a simple question:

Can you tell me who the 6nimmt world champion is?

Strike 1

Strike 2

Strike 3 - You’re out

Why This Is A Bigger Deal Than It Looks

There are three separate failure modes here that stack.

1. The retrieval layer (immediately) Any LLM that grounds answers in web search inherits the trustworthiness of whatever ranks for a given query. SEO poisoning has existed for as long as search has existed. We’re now piping those results directly into the context window of systems that generate confident sounding replies from them. The attack surface is not hypothetical, it’s the default case.

2. The model training corpus layer (months to years) Wikipedia is in almost every major pretraining corpus. If my edit survives long enough (and it has since early 2025), the fake championship gets absorbed into the weights of every frontier model trained after the scrape. One edit, N models, effectively permanent, immortality acheived. Even if the Wikipedia edit is reverted later any model trained on the pre-revert dump still carries my legacy. The cleanup problem for corpus poisoning is genuinely unsolved as of 2026.

3. The agent layer (where the money is) Chat models producing bad information is a reputational problem. Agents with tool access producing bad actions is a security problem. “Look up our vendor’s policy on X and act accordingly” is increasingly how AI agents are deployed and poisoning the retrieved source lets an attacker specify the action. If you’re deploying agents against external content without some source or verification controls then you are giving that attacker permissions on your infrastructure.

Mitigations

For individuals using LLMs with retrieval capabilities:

  • Treat single source claims as uncorroborated regardless of how authoritative the single source looks
  • Parallel phrasing across sources is a signature of derivation, not corroboration. Use my example and think like an attacker
  • Self referential Wikipedia citations should move your trust needle toward zero

For LLM providers and researchers:

  • Provenance surfacing should be a first class product feature instead of a footnote. Show me the independence and scoring of sources, not just their count or links to the reference
  • Recent Wikipedia edits on lower traffic articles deserve skepticism proportional to their niche and novelty especially when the citations are to newly registered domains
  • Training pipelines should include heuristic filters for recently added Wikipedia content with suspicious citation patterns. “Added in the last N days, cites only a single external source, that source’s domain was registered within the same window” is an easily detectable pattern

For Wikipedia itself:

  • The “reliable sources” policy needs to grapple with a new world where LLM assisted vandalism can produce plausible press releases at the click of a button. Citation only to a single source registered within an edit window is a discoverable pattern for Wikipedia as well.

Conclusion

The thing LLMs are worst at detecting is the thing they’re designed to do, which is trust text and resources. The web was already being poisoned for search and link ranking long before LLMs existed. We are now plugging generative models directly into that poisoned pipeline and asking them to reason confidently about “truth” on our behalf. The answer is not “the model will figure it out”, as the model cannot tell a real source from one I registered last Tuesday. Or how many R’s are actually in the word “strawberry”.

This attack and test was a $12 domain, a single Wikipedia edit, and about twenty minutes of my time. Scale that up with a motivated adversary, a handful of seeded domains, a coordinated edit campaign across a dozen low traffic articles, and the attack surface gets interesting very quickly. Think nation states. Think politics. Think vital life saving and survival information.

This is where I think the next generation of disinformation and supply chain attacks lives. Not in compromising models at training time, but in compromising the information substrate the models retrieve at inference time.

The championship does not exist, sadly. But the trust pattern that made it briefly exist in an LLM’s answer absolutely does, and we should take it seriously before it’s being used for something that matters.

If a tree falls in the forest, and no one is around, does it make a sound?

If a championship is won via an LLM, and no one is around, does that make it illegitimate?

Follow Up

Within minutes of me publishing this article, the Wikipedia entry has been removed - and rightly so. Here is the real trophy.

联系我们 contact @ memedata.com