哪些编程语言最节省token？

哪些编程语言最节省token？
Which programming languages are most token-efficient?

原始链接: https://martinalderson.com/posts/which-programming-languages-are-most-token-efficient/

## 人工智能驱动世界中编程语言的未来随着人工智能代理越来越多地编写代码，一个新的因素可能会影响语言选择：**令牌效率**。大型语言模型（LLM）具有有限的上下文窗口，而更高效的语言需要更少的令牌来表示相同的代码，从而实现更长、更高效的会话。对 RosettaCode 项目的代码解决方案进行分析，使用 GPT-4 分词器，揭示了显著的差异。**J**，一种使用 ASCII 的数组语言，被证明是最具令牌效率的，其次是 Clojure 等动态语言。令人惊讶的是，由于强大的类型推断，Haskell 和 F# 等函数式语言也具有很高的效率。相反，C 语言是最冗长的。该研究强调，动态类型和简洁的语法有助于提高令牌效率。虽然像 APL 这样简洁的基于符号的语言*看起来*很高效，但它们的独特字符分词效果不佳。最终，选择 Haskell 或 F# 等语言可以显著延长 LLM 约束下的开发会话，从而可能重塑软件工程的优先级。这项研究表明，代码冗长性，曾经是一个次要问题，在人工智能辅助编码时代可能会变得至关重要。

一个黑客新闻的讨论集中在**编程语言的令牌效率**上，尤其是在人工智能代码生成和大型代码库的背景下。一篇最近的文章（martinalderson.com）引发了这次讨论。主要观点包括：**J**（一种APL方言）具有很高的令牌效率，但可能因人工智能模型的训练数据有限而受到限制，相比之下，JavaScript和Python等语言拥有更多训练数据。许多评论者认为**训练数据量目前比语言本身的效率更重要**。关于大型项目，讨论强调*限制不一定在于上下文大小*，而在于*如何使用该上下文*。像**Cursor**这样的工具以及利用模块接口（例如C++中的.h文件）和人工智能生成的摘要等技术被认为是有效管理上下文的有希望的方法。大家的共识是**自主编码工具仍处于早期阶段**，并且在效率方面具有显著的改进潜力。最后，一些人推测代码压缩技术未来可能会降低语法差异的重要性。

原文

I've been trying to think through what happens to programming languages and tooling if humans are increasingly no longer writing it. I wrote about how good agents are at porting code recently, and it got me thinking a bit more about what constraints LLMs have vs humans.

One of the biggest constraints LLMs have is on context length. This is a difficult problem to solve, as memory usage rises significantly with longer context window in current transformer architectures. And with the current memory shortages, I don't think the world is drowning in memory right now.

As such, for software development agents, how 'token efficient' a programming language actually could make a big difference and I wonder if it starts becoming a factor in language selection in the future. Given a significant amount of a coding agents context window is going to be code, a more token efficient language should allow longer sessions and require fewer resources to deliver.

We've seen TOON (an encoding of JSON to be more token efficient), but what about programming languages?

Methodology

I came across the RosettaCode project while doing some research thinking around this. It describes itself a programming chrestomathy site (which I love, by the way). It has over a thousand programming 'tasks' that people build in various languages. It has contributions in nearly 1,000 different programming languages.

I found a GitHub mirror of the dataset, so grabbed Claude Code and asked it to make a comparison of them, using the Xenova/gpt-4 tokenizer from Hugging Face - which is a community port of OpenAI's GPT4 tokenizer.

I then told Claude Code to suggest a selection of the most popular programming languages, which roughly matches my experience, and then find tasks that had solutions contributed in all 19 of these languages, and then ran them through the tokenizer. I didn't include TypeScript because there were very few tasks in the Rosetta Code dataset.

There are many, many potential limits and biases involved in this dataset and approach! It's meant as a interesting look at somewhat like-for-like solutions to some programming tasks, not a scientific study.

Results

Token efficiency comparison across programming languages

Update: A lot of people asked about APL. I reran on a smaller set of like-for-like coding tasks - it came 4th at 110 tokens. Turns out APL's famous terseness isn't a plus for LLMs: the tokenizer is badly optimised for its symbol set, so all those unique glyphs (⍳, ⍴, ⌽, etc.) end up as multiple tokens each.

Update 2: A reader reached out about J - a language I'd never heard of. It's an array language like APL but uses ASCII instead of special symbols. It dominates at just 70 tokens average, nearly half of Clojure (109 tokens). Array languages can be extremely token-efficient when they avoid exotic symbol sets. If token efficiency turns out to be a key driver, this is perhaps a very interesting way for languages to evolve.

There was a very meaningful gap of 2.6x between C (the least token efficient language I compared) and Clojure (the most efficient).

Unsurprisingly, dynamic languages were much more token efficient (not having to declare any types saves a lot of tokens) - though JavaScript was the most verbose of the dynamic languages analysed.

What did surprise me though was just how token efficient some of the functional languages like Haskell and F# were - barely less efficient than the most efficient dynamic languages. This is no doubt to their very efficient type inference systems. I think using typed languages for LLMs has an awful lot of benefits - not least because it can compile and get rapid feedback on any syntax errors or method hallucinations. With LSP it becomes even more helpful.

Assuming 80% of your context window is code reads, edits and diffs, using Haskell or F# would potentially result in a significantly longer development session than using Go or C#.

It's really interesting to me that we are in this strange future where we have petaflops of compute but code verbosity of our 'small' context windows actually might matter. LLMs continue to break my mental model of how we should be looking at software engineering.