2,218 个盖瑞·马库斯关于人工智能的说法与证据(数据集)对照
Marcus AI Claims Dataset

原始链接: https://github.com/davegoldblatt/marcus-claims-dataset

对知名人工智能怀疑论者加里·马库斯在Substack上发表的2218条可验证声明(2022年5月至2026年3月)的全面分析显示,他出乎意料地经常是正确的。该研究使用两个LLM管道(Claude和ChatGPT)评估声明与现有证据的符合程度,发现近60%的声明得到了支持,33.7%的声明结果不一,只有6.4%的声明被证伪。 马库斯擅长识别*具体*的技术缺陷——他对LLM安全性、Sora可靠性以及人工智能代理过早出现的预测都非常准确。然而,他的*市场预测*可靠性明显较低,特别是他对人工智能泡沫破裂的预测,其中27%的声明被证伪。 有趣的是,马库斯投入了更多笔墨讨论他*错误*的领域。分析结果显示,对某一主题的关注度越高,他的预测出现错误的概率就越高。完整数据集、方法论和LLM评估结果已公开提供,但结果由LLM评估,未经人工核实。

## 黑客新闻讨论:评估 Gary Marcus 的人工智能观点 最近黑客新闻上出现了一场讨论,围绕着一个数据集(github.com/davegoldblatt),该数据集试图评估 Gary Marcus 所做的人工智能预测的准确性。创建者承认,目前评分是由 LLM 完成的,而非人类,这引发了对可靠性和潜在偏见的担忧——特别是考虑到来源材料主要是企业人工智能博客文章。 一些评论员质疑该方法论,认为它过于依赖于容易验证的声明,而忽略了 Marcus 的核心论点:深度学习并非通往 AGI 的必然之路,并且可能被过度炒作。一些人认为 Marcus 一直在准确地跟踪人工智能的进展,而另一些人则指出他预测的迫在眉睫的“崩溃”已被证明是不正确的。 一个反复出现的主题是,Marcus 的批评是真正具有预测性,还是仅仅是无可争辩的观察(例如 LLM 会犯错),被框架为更大、无法证伪的声明。对于潜在的 AI “寒冬”持怀疑态度,一些人预测,即使市场出现修正,大型公司仍将继续蓬勃发展。数据集创建者欢迎反馈,并鼓励用户抽查 LLM 的评分。
相关文章

原文

Gary Marcus is the most prolific AI skeptic on the internet. Since May 2022, he's published 474 posts on Substack making claims about AI's limitations, the companies building it, and where the industry is headed.

We extracted every testable claim. 2,218 of them. Scored each one against the evidence as of March 2, 2026. Here's what the data shows.

He's more right than wrong

Among claims where the evidence is checkable:

  • 59.9% supported
  • 33.7% mixed
  • 6.4% contradicted

That's not the number most people expect from either side.

Where he's right and where he's wrong

His best work is specific and technical. When Marcus points at a broken thing and says "that's broken," the evidence backs him up almost perfectly. LLM security vulnerabilities: 100% supported, 0% contradicted. Sora video unreliable: 90% supported, 0% contradicted. Agents premature for production: 88% supported, 0% contradicted. Across those three clusters, not a single claim was contradicted by the evidence.

Best and worst clusters

His worst work is market prediction. "GenAI bubble will burst": 27% contradicted, his single worst cluster out of 54. He went from "potential AI winter" (2023) to "greatest capital destruction in history" (2025) to "the whole thing was a scam" (Feb 2026). The crash hasn't come.

He writes more about the thing he's getting wrong. His hallucination cluster (most vindicated thesis) spiked when he established it in early 2023, then settled into a steady drumbeat. His bubble cluster (most contradicted) went from near-zero in 2023 to his highest quarterly output in Q4 2025.

Accuracy by year

Two LLM pipelines analyzed the same corpus, then a reconciliation layer compared their outputs:

  • Claude Code (Opus 4.6): 2,218 individual claims, 54 clusters. Claim-level granularity, willing to render verdicts.
  • Codex (ChatGPT): 164 themes, 11 categories. Theme-level, conservative — defaults to "unresolved" unless clear cross-vendor evidence exists.

A hybrid reconciliation layer maps both into a unified view. Full methodology in DATASET_GUIDE.md.

File What it gives you
DATASET_GUIDE.md Full methodology, decision rules, file inventory
outputs/chatgpt/tables/chatgpt_hybrid_reconciliation.csv Canonical reconciled view — both pipelines per theme
claude/claude_analysis_memo.md Narrative findings with scorecard and goalpost analysis
claude/claude_claims_final.jsonl Every claim with verbatim quotes, scores, and cluster assignment
claude/claude_claims_canonical.csv One row per cluster with aggregate stats and revision notes

The raw posts (posts/*.txt) are excluded — copyrighted Substack content. A proof bundle verifies all 474 posts were processed without publishing full text.

All verdicts are LLM-scored, not human-verified. "Supported" means "supported according to an LLM evaluating evidence available as of March 2026." Spot-check against source posts before citing specific claims. See Known Limitations for the full list.

Full essay with commentary: The Most Expensive Kind of Correct

David Goldblatt, with Claude Code (Opus 4.6) and Codex (ChatGPT) running independent pipelines. Built 2026-03-02 in a single session. Provenance: claude/claude_AUDIT_LOG.md.

联系我们 contact @ memedata.com