Windows 剪贴板文本格式合成中的各种区域设置不匹配场景

Windows 剪贴板文本格式合成中的各种区域设置不匹配场景
Various locale mismatch scenarios in Windows clipboard text format synthesis

原始链接: https://devblogs.microsoft.com/oldnewthing/20251211-37/?p=111858

## Unicode 与剪贴板转换：总结 Windows 剪贴板使用基于活动键盘布局的 `CF_LOCALE` 格式，来处理 Unicode 与 8 位 ANSI/OEM 代码页之间的转换。然而，这种方法引发了关于正确性的问题，尤其是在希伯来语等语言中。将 Unicode 文本复制为 `CF_UNICODE_TEXT` 可以保留原始字符，如果将其作为 Unicode 读取。如果将其作为 `CF_TEXT` 读取，则会转换为系统的 ANSI 代码页（例如，对于美国英语为 1252），这通常对于期望 ANSI 的程序是正确的。尝试将希伯来语存储在 ANSI 1252 中是存在问题的，因为它缺少希伯来语字符。历史上，所有程序都使用相同的 ANSI/OEM 代码页，从而简化了转换。现在，有了 `activeCodePage`，差异可能导致“乱码”（不正确的字符显示），如果读取和写入程序对代码页存在分歧。虽然一些转换已更新以反映 `activeCodePage`，但由于复杂性和潜在的兼容性问题，其他转换没有更新。最近的测试揭示了 `CF_TEXT`-to-`CF_OEM_TEXT` 转换中意想不到的行为，暗示了潜在的复杂性，我们将进一步探索。具体来说，一个测试用例期望一个字符映射到 OEM 437 中的 'D'，但实际上收到了完全不同的字符。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 Windows 剪贴板文本格式合成中的各种区域设置不匹配场景 (devblogs.microsoft.com/oldnewthing) 6 分，ibobev 发表于 1 小时前 | 隐藏 | 过去 | 收藏 | 1 条评论 akersten 7 分钟前 [–] 我不知道要消除所有这些 OEM LCID 1252 ANSI 废话需要什么（好吧，只是 Windows），但如果我负责“确保开发者愿意选择在 Win32 而不是任何其他理性的 Unicode 平台上来工作”，我会把它作为我的首要任务。用某种神奇的区域设置指示器标记剪贴板文本所解决的任何想象中的问题，肯定不如能够在程序之间无缝地互操作 Unicode 字符重要，而无需阅读两篇博客文章。指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

原文

So far, we’ve learned that the conversion between Unicode and the 8-bit ANSI and OEM code pages is performed with the assistance of the CF_LOCALE clipboard format, which itself comes from the active keyboard layout. We left with the question of whether this is the right thing, giving as an example the case of highlighting some text in Hebrew and copying it to the clipboard. Shouldn’t that be set with a Hebrew LCID?

First of all, you have to specify what you mean by “copy it to the clipboard.” Suppose the English-language user selected some Hebrew text and the program set it to the clipboard as CF_UNICODETEXT with a Hebrew LCID. A program which reads the CF_UNICODETEXT will read the original Unicode text, with Hebrew characters intact. The LCID plays no role since no conversion was performed. So in the case where the string was placed as Unicode and retrieved as Unicode, everything is fine.

If the string were placed as Unicode but read as CF_TEXT, the retrieving program will get the string translated to code page 1252, since that is the ANSI code page used by the US-English LCID. Is this the correct code page? Well, if the retrieving program is using CF_TEXT, then it is a program that uses the 8-bit ANSI character set as its string encoding, and if you’re running on a US-English system, then the 8-bit ANSI character set is code page 1252. So translating the Hebrew text to ANSI via code page 1252 is correct. You need to translate the string into the ANSI code page that the retrieving program is using.

Conversely, if the Hebrew string were placed on the clipboard as 8-bit ANSI in code page 1252, then… wait, that’s a trick question! Code page 1252 doesn’t have any Hebrew characters! If a program uses the US-English 8-bit ANSI character set, it cannot represent Hebrew characters at all, so the scenario itself is flawed: There can’t be any Hebrew text on the screen to be selected since the program has no way of displaying it.

Now, I guess it could be possible if a program internally supported enough Unicode to display Hebrew characters, but still chose to put text on the clipboard in ANSI format. But in that case, it would be putting question marks on the clipboard since there are no Hebrew characters in code page 1252. Any program that does this intentionally is clearly being pathological: Why do all the work to display characters in Unicode, yet copy those character to the clipboard in 8-bit ANSI?

But wait, let’s rewind to a simpler scenario where there are no character set conversions at all. A program sets text on the clipboard in 8-bit ANSI, and another program reads it. If we consult our table, we see that the entry for this is “N/A”: There is no conversion. This holds true even if the program that put the text on the clipboard and the program that reads the text from the clipboard disagree on what the 8-bit ANSI code page is.

Prior to the introduction of the activeCodePage manifest declaration, the identity of the 8-bit ANSI code page was the same for all applications running in the same desktop. There was no opportunity for mismatch, so if one program put the text on the clipboard in 8-bit ANSI, and another read it out in 8-bit ANSI, they necessarily agreed on what the 8-bit ANSI code page was, since there was only one. But now that we have the ability for different programs to have a different value for the 8-bit ANSI code page, this nop-transformation will result in mojibake if the reader and writer have different ideas about what the 8-bit ANSI code page is.

You have the same problem with the AnsiToOem conversion: Historically, all programs agreed on what the 8-bit ANSI and 8-bit OEM code pages are, so the system maintains a single “ANSI⇆OEM” conversion table that is shared by all processes. But now that programs can choose (indirectly) their ANSI and OEM code pages, you have a problem if those choices don’t match those the system would have chosen.

The people who added activeCodePage support hooked it up to the GetACP() and GetOEMCP() functions, as well as the to the A-suffixed functions which convert their 8-bit ANSI string parameters to Unicode before forwarding the result to the W-suffixed functions. But there are other places that didn’t get updated because doing so would require larger architectural changes, would affect performance of programs that didn’t use the activeCodePage feature, would introduce regression risk, and could lead to compatibility problems. Not saying that they couldn’t have done it, but it would have taken longer, and maybe it’s better to have a good-enough feature than a perfect one.

While doing fact-checking on this series of articles, I wrote some test programs that tried to trigger the CF_TEXT-to-CF_OEMTEXT conversion, and they didn’t behave as I expected.

// Note: Test program doesn't do error-checking.

// Put the ANSI string "\xD0\x00" on the clipboard,
// with the locale 1049 (ru-ru).
int main()
{
    if (OpenClipboard(hwnd)) {
        EmptyClipboard();

        // Put an ANSI string on the clipboard.
        HGLOBAL glob = GlobalAlloc(GMEM_MOVEABLE, 2);
        PSTR message = (PSTR)GlobalLock(glob);
        message[0] = 0xD0;
        message[1] = 0x00;
        GlobalUnlock(glob);
        SetClipboardData(CF_TEXT, glob);

        // Mark it as locale 0x0419 = 1049 = ru-ru
        glob = GlobalAlloc(GMEM_MOVEABLE, sizeof(LCID));
        *(LCID*)GlobalLock(glob) = 0x0419;
        GlobalUnlock(glob);
        SetClipboardData(CF_LOCALE, glob);

        CloseClipboard();
    }
}

And here’s the program to read the string back out in the OEM code page.

int main()
{
    if (OpenClipboard(hwnd)) {
        HGLOBAL glob = GetClipboardData(CF_OEMTEXT);
        PSTR message = (PSTR)GlobalLock(glob);
        printf("%0x02x\n", message[0]);
        GlobalUnlock(glob);

        CloseClipboard();
    }
}

I ran this on a US-English system, so the LCID is 0x0409 = 1033, the ANSI code page is 1252, and the OEM code page is 437. The character D0 in code page 1252 is Ð = U+00D0. This character does not exist in code page 437, so AnsiToOem uses the best-fit character D = U+0044, which is in position 44 in code page 437.

When I ran this program, I expected the CF_OEMTEXT string to have the byte 44, but it didn’t. It had the byte 90. We will start unraveling this mystery next time.

Windows 剪贴板文本格式合成中的各种区域设置不匹配场景 Various locale mismatch scenarios in Windows clipboard text format synthesis

Windows 剪贴板文本格式合成中的各种区域设置不匹配场景
Various locale mismatch scenarios in Windows clipboard text format synthesis