Kotlin编译器中持续多年的土耳其字母表错误

Kotlin编译器中持续多年的土耳其字母表错误
A years-long Turkish alphabet bug in the Kotlin compiler

原始链接: https://sam-cooper.medium.com/the-country-that-broke-kotlin-84bdd0afb237

穆罕默德·德米尔巴什（Muhammed Demirbaş）发现了一个Kotlin编译器中微妙但影响重大的错误，该错误源于编译器处理XML消息标签的方式。编译器解析类似``和``的标签，并将其转换为小写，以便在预定义的类别映射中查找。问题在于Kotlin的`toLowerCase()`函数会根据计算机的区域设置表现不同。在土耳其语中，大写字母‘I’转换为小写时会变成‘ı’（无点i），而不是‘i’（带点i）。这意味着在土耳其系统上，``会变成“ınfo”，这与编译器类别映射中预期的“info”键不匹配，从而导致“未知编译器消息标签”错误。虽然该错误是在2016年发现的，但由于报告有限，优先级不高。然而，两年后Kotlin协程发布时，该错误再次浮出水面，影响更大，因此解决该问题变得更加重要。根本原因在于Unicode和特定区域设置的小写转换的复杂性。

Kotlin编译器中与土耳其字母表相关的长期存在的错误最近被详细说明。问题源于编译器内部不正确的字符小写转换，尤其是在处理土耳其字符时，导致难以理解的错误。几位开发者分享了在Java中遇到类似问题的经历，需要费力地进行代码审查，并在转换枚举时显式使用`Locale.ENGLISH`。一位评论员指出，过去迁移到新方法可能让Android开发者感到烦恼，但最终改善了土耳其用户的功能。除了土耳其字母表错误之外，另一位用户报告了Kotlin脚本文件中`List.isEmpty()`始终返回`true`的独立问题，凸显了IDE潜在的不稳定性。这场讨论强调了国际化的挑战以及可能因字符编码复杂性而产生的微妙错误。

原文

Muhammed Demirbaş couldn’t have been more spot on in his investigation and assessment of the compiler bug. Since Kotlin is open source, he was able to search the compiler’s code for the exact line of code where that “Unknown compiler message tag” string appears:

val qNameLowerCase = qName.toLowerCase()
var category: CompilerMessageSeverity? = CATEGORIES[qNameLowerCase]
if (category == null) {
messageCollector.report(ERROR, "Unknown compiler message tag: $qName")
category = INFO
}

So what does this code do, and why does it sometimes go wrong?

The code is part of a class named CompilerOutputParser, and is responsible for reading XML files containing messages from the Kotlin compiler. Those files look something like this:

<MESSAGES>
<INFO path="src/main/Kotlin/Example.kt" line="1" column="1">
This is a message from the compiler about a line of code.
</INFO>
</MESSAGES>

At the time, the tags in this file were named in all-caps: <INFO/>,<ERROR/>, and so on (source: GitHub), like the HTML 1.0 webpages your grandpa used to write.

In the Kotlin code we just saw, qName is the name of an XML tag that we’re parsing from this file. If we’re looking at an <INFO/> tag, the qName is “INFO.”

To determine what the message means, the CompilerOutputParser next looks up that string in its CATEGORIES map to find its corresponding CompilerMessageSeverity enum entry. But wait: the keys in the CATEGORIES map are lower case! (source: GitHub)

val categories = mapOf(
"error" to CompilerMessageSeverity.ERROR,
"info" to CompilerMessageSeverity.INFO,
…
)

Instead of searching for “INFO,” we need to search for “info.” That’s why the code we looked at calls qName.toLowerCase() before looking it up in the CATEGORIES map. Here’s the code again, or at least the relevant lines:

val qNameLowerCase = qName.toLowerCase()
var category: CompilerMessageSeverity? = CATEGORIES[qNameLowerCase]

And that’s where the bug sneaks in.

If your computer is configured in English, "INFO".toLowerCase() is "info", just like we wanted.
But if your computer is configured in Turkish, "INFO".toLowerCase() turns out to be "ınfo".

Notice the difference? In the Turkish version, the lower case letter ‘ı’ has no dot above it.

The tiny discrepancy might be hard for a human to spot, but to a computer, these are two completely different strings. The dotless "ınfo" string isn’t one of the keys in CATEGORIES map, so the code fails to find the correct CompilerMessageSeverity for our <INFO/> tag, and complains that “INFO” must be a completely unknown category of message.

So why does calling toLowerCase() on a Turkish computer produce this strange result?

Muhammed already provided part of the answer in his reply to Mehmet Nuri’s forum post. Turkic languages have two versions of the letter ‘i’:

an ‘i’ with a dot, as in the word insan (human),
and a separate ‘ı’ without a dot, as in the word ırmak (river).

What’s more, the dotted/dotless distinction is also preserved in the upper case letters:

capital ‘i’ is ‘İ’, as in insan → İnsan,
and capital ‘ı’ is ‘I’, as in ırmak → Irmak.

That uppercase dotless ‘I’ is the same one we use in English. As a result, the single Unicode character I (U+0049) has two different lower case forms: dotted i (U+0069) in English, and dotless ı (U+0131) in Turkish.

For Kotlin’s toLowerCase() function, that’s a problem! When toLowerCase() sees an I character, which lower case form should it use? The lower case form of the Turkish word IRMAK should be ırmak, with no dot. But the lower case form of the English word INFO, which starts with exactly the same character, should be info, with a dot.

When you ask your computer to convert text to lower case, you should technically also specify the alphabet rules to use—English, Turkish, or something else entirely. But that’s a lot of hard work, so if you don’t specify, many systems — including, in those days, Kotlin’s toLowerCase() function — will just use the language settings you chose when you set up your computer. That’s why "INFO".toLowerCase() is "ınfo" when you run it on a Turkish machine, and that’s why IntelliJ installations in Turkey couldn’t match the Kotlin compiler’s <INFO/> messages to the lowercase "info" string they were expecting to see.

But in 2016, all of that was still just a bug ticket waiting to be worked on. Muhammed Demirbaş had identified the right place to start the search, but the YouTrack issue linked to his findings was just one of hundreds of tickets in the Kotlin project backlog. With only a tiny number of people reporting that they were affected by the bug, a more thorough investigation was never a priority.

That would all change with the release of coroutines two years later, when the unassuming little bug wormed its way even deeper into the foundations of the Kotlin compiler.

Kotlin编译器中持续多年的土耳其字母表错误 A years-long Turkish alphabet bug in the Kotlin compiler

Kotlin编译器中持续多年的土耳其字母表错误
A years-long Turkish alphabet bug in the Kotlin compiler