丢失一百五十万行围棋代码
Losing 1½ Million Lines of Go

原始链接: https://www.tbray.org/ongoing/When/202x/2026/01/14/Unicode-Properties

## Quamina 正则表达式更新:驯服 Unicode 属性 本文详细介绍了将 Unicode 字符属性匹配(如 `\p{L}` 用于字母)添加到 Quamina 模式匹配软件的过程。挑战源于 Go 的 Unicode 库过时,促使直接解析 Unicode 字符数据库。 最初尝试预计算和序列化这些属性的自动机,导致了大量的代码生成(最初超过 77.5 万行,预计达到 150 万行)和性能问题——启动时间慢和 IDE 崩溃。这种方法被放弃,转而采用缓存系统:自动机现在在首次使用时计算,然后存储以提高效率,将速度从每秒 135 个模式提高到 4330 个模式。 作者还反思了潜在地利用 GenAI(如 Claude)来处理涉及到的常规任务,承认它有可能加速开发。虽然对更广泛的 GenAI 炒作持怀疑态度,但他们认识到它在代码预测方面的优势,并后悔由于设置时间和缺乏耐心而没有利用它。 随着此功能即将完成,Quamina 2.0——拥有完整的正则表达式支持——即将到来。然而,作者对社区贡献减少表示担忧,并希望在未来的开发中重振协作。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 丢失150万行Go代码 (tbray.org) 8点 由 moks 1小时前 | 隐藏 | 过去 | 收藏 | 讨论 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索:
相关文章

原文

Confession: My title is clickbait-y, this is really about building on the Unicode Character Database to support character-property regexp features in Quamina. Just halfway there, I’d already got to 775K lines of generated code so I abandoned that particular approach. Thus, this is about (among other things) avoiding those 1½M lines. And really only of interest to people whose pedantry includes some combination of Unicode, Go programming, and automaton wrangling. Oh, and GenAI, which (*gasp*) I think I should maybe have used.

Character property matching · I’m talking about regexp incantations like [\p{L}\p{Zs}\p{Nd}], which matches anything that Unicode classifies as a letter, a space, or a decimal number. (Of course, in Quamina “\” is “~for excellent reasons, so that reads [~p{L}~p{Zs}~p{Nd}].)

(I’m writing about this now because I just launched a PR to enable this feature. Just one more to go before I can release a new version of Quamina with full regexp support, yay.)

Finding the properties · To build an automaton that matches something like that, you have to find out what the character properties are. This information comes from the Unicode Character Database, helpfully provided online by the Unicode consortium. Of course, most programming languages have libraries that will help you out, and that includes Go, but I didn’t use it.

Unfortunately, Go’s library doesn’t get updated every time Unicode does. As of now, January 2026, it’s still stuck at Unicode 15.0.0, which dates to September 2023; the latest version is 17.0.0, last September. Which means there are plenty of Unicode characters Go doesn’t know about, and I didn’t want Quamina to settle for that.

So, I fetched and parsed the famous master file from www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt. Not exactly rocket science, it’s a flat file with ;-delimited fields, of which I only cared about the first and third. There are some funky bits, such as the pair of nonstandard lines indicating that the Han characters occur between U+4E00 and U+9FFF inclusive; but still not really taxing.

The output is, for each Unicode category, and also for each category’s complement (~P{L} matches everything that’s not a letter; note the capital P), a list of pairs of code points, each pair indicating a subset of the code space where that category applies. For example, here’s the first line of character pairs with category C.

    {0x0020, 0x007e}, {0x00a0, 0x00ac}, {0x00ae, 0x0377},

How many pairs of characters, you might wonder? There are 37 categories and it’s all over the place but adds up to a lot. The top three categories are L with 1,945 pairs, Ll at 664, and M at 563. At the other end are Zl and Zp, both with just 1. The total number of pairs is 14,811, and the generated Go code is a mere 5,122 lines.

Character-property automata · Turning these creations into finite automata was straightforward: I already had the code to handle regexps like [a-zA-Z0-9], logically speaking the same problem. But, um, it wasn’t fast. My favorite unit test, an exercise in sample-driven development with 992 regexps, suddenly started taking multiple seconds, and my whole unit-test suite expanded from around ten seconds to over twelve; since I tend to run the unit tests every time I take a sip of coffee or scratch my head or whatever, this was painful. And it occurred to me that it would be painful in practice to people who want for some good reason or another to load up a bunch of Unicode-property patterns into a Quamina instance.

So, I said to myself, I’ll just precompute all the automata and serialize them into code. And now we get to the title of this essay; my data structure is a bit messy and ad-hoc and just for the categories, before I got to the complement versions, I was generating 775K lines of code.

Which worked! But, it was 12M in size and while Go’s runtime is fast, there was a painful pause while it absorbed those data structures on startup. Also, opening the generated file regularly caused my IDE (Goland) to crash. And I was only halfway there. The whole approach was painful to work with so I went looking for Plan B.

The code that generates the automaton from the code point pairs is pretty well the simplest thing that could possibly work and it was easy to understand but burned memory like crazy. So I worked for a bit on making it faster and cheaper, but so far have found no low-hanging fruit.

I haven’t given up on that yet. But in the meantime, I remembered Computer Science’s general solution for all performance problems, by which I mean caching. So now, any Quamina instance will compute the automaton for a Unicode property the first time it’s used, then remember it. So now Quamina’s speed at adding Unicode-property regexps to an instance has increased from 135/second to 4,330, a factor of thirty and Good Enough For Rock-n-Roll.

It’s worth pointing out that while building these automata is a heavyweight process, Quamina can use them to match input messages at its typical rates, hundreds of thousands to millions per second. Sure, these automata are “wide”, with lots of branches, but they’re also shallow, since they run on UTF-8 encoded characters whose maximum length is four and average length is much less. Most times you only have to take one or two of those many branches to match or fail.

Should I have used Claude? · This particular segment of the Quamina project included some extremely routine programming tasks, for example fetching and parsing UnicodeData.txt, computing the sets of pairs, generating Go code to serialize the automata, reorganizing source files that had become bloated and misshapen, and writing unit tests to confirm the results were correct.

Based on my own very limited experience with GenAI code, and in particular after reading Marc Brooker’s On the success of ‘natural language programming’ and Salvatore (“antirez”) Sanfilippo’s Don't fall into the anti-AI hype, I guess I’ve joined the camp that thinks this stuff is going to have a place in most developers’ toolboxes.

I think Claude could have done all that boring stuff, including acceptable unit tests, way faster than I did. And furthermore got it right the first time, which I didn’t.

So why didn’t I use Claude? Because I don’t have the tooling set up and I was impatient and didn’t want to invest the time in getting all that stuff going and improving my prompting skills. Which reminds me of all the times I’ve been trying to evangelize other developers on a better way to do things and was greeted by something along the lines of “Fine, but I’m too busy right now, I’ll just going on doing things the way I already know how to.”

Does this mean I’m joining the “GenAI is the future and our investments will pay off!” mob? Not in the slightest. I still think it’s overpriced, overhyped, and mostly ill-suited to the business applications that “thought leaders” claim for it. That word “mostly” excludes the domain of code; as I said here, “It’s pretty obvious that LLMs are better at predicting code sequences than human language.”

And, as it turns out, the domain of Developer Tools has never been a Big Business by the standards of GenAI’s promoters. Nor will it ever be; there just aren’t enough of us. Also, I suspect it’ll be reasonably easy in the near future for open-source models and agents to duplicate the capabilities of Claude and its ilk.

Speaking personally, I can’t wait for the bubble to pop.

Quamina.next? · After I ship the numeric-quantifier feature, e.g. a{2-5}, Quamina’s regexp support will be complete and if no horrid bugs pop up I’ll pretty quickly release Quamina 2.0. Regexps in pattern-matching software are a qualitative difference-maker. After that I dunno, there are lots more interesting features to add.

Unfortunately, a couple years into my Quamina work, I got distracted by life and by other projects, and ignored it. One result is that so did the other people who’d made major contributions and provided PR reviews. I miss them and it’s less fun now. We’ll see.



联系我们 contact @ memedata.com