评估LLM通过深奥语言的真正推理能力:EsoLang-Bench
EsoLang-Bench: Evaluating Genuine Reasoning in LLMs via Esoteric Languages

原始链接: https://esolang-bench.vercel.app/

当前LLM代码生成基准测试结果虚高,原因是评估侧重于流行的语言,如Python,模型很可能*记忆*了大量训练数据中的解决方案,而不是真正地*推理*。为了解决这个问题,研究人员创建了**EsoLang-Bench**,一个新的基准测试,使用了五种晦涩的“深奥”编程语言,这些语言的训练数据极少(比Python少5,000-100,000倍)。 对五种领先的LLM进行测试显示,性能大幅下降:准确率降至仅3.8%,而类似的Python问题的准确率约为90%。模型甚至在简单的任务上都难以应对,在更难的问题上完全失败,并且无法解决Whitespace语言。即使像自我反思这样的技术也无法带来改进。 EsoLang-Bench 凸显了报告的LLM能力与真正的编程技能之间的显著差距,表明当前的基准测试高估了它们真正的推理能力。

一个新的基准测试,**EsoLang-Bench**,正在评估大型语言模型(LLM)使用如Unlambda、Brainfuck和Malbolge等*深奥*编程语言进行推理的能力。初步结果,在Hacker News上讨论,显示即使是最强大的模型(如Qwen-235B)表现也出乎意料地差。 用户感到震惊,指出模型在处理一些看似比人类学习了基础概念(如lambda演算)后就能掌握的语言时遇到困难。一位评论员指出Brainfuck的难度,尽管它与C语言相似。 一个可能的解释是LLM代码分词的方式;许多单字符关键词的深奥语言可能对推理构成挑战。有人建议修改基准测试,为这些语言使用单token关键词,以查看性能是否会提高。该基准测试旨在评估超越典型编码任务的“真正推理”能力。
相关文章

原文

Current benchmarks for large language model (LLM) code generation primarily evaluate mainstream languages like Python, where models benefit from massive pretraining corpora. This leads to inflated accuracy scores that may reflect data memorization rather than genuine reasoning ability. We introduce EsoLang-Bench, a benchmark of 80 programming problems across five esoteric languages (Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare) where training data is 5,000 to 100,000x scarcer than Python.

We evaluate five frontier models using five prompting strategies and two agentic coding systems. The best-performing model achieves only 3.8% overall accuracy, compared to ~90% on equivalent Python tasks. All models score 0% on problems above the Easy tier, Whitespace remains completely unsolved (0% across all configurations), and self-reflection provides essentially zero benefit. These results reveal a dramatic gap between benchmark performance on mainstream languages and genuine programming ability, suggesting that current LLM code generation capabilities are far narrower than headline metrics imply.

联系我们 contact @ memedata.com