Windows 如何将 CF_TEXT 合成 CF_Unicode­TEXT,反之亦然?
How does Windows synthesize CF_Unicode­TEXT from CF_TEXT and vice versa?

原始链接: https://devblogs.microsoft.com/oldnewthing/20251209-00/?p=111854

Windows 通过各种转换处理文本剪贴板格式——CF_TEXT、CF_OEMTEXT 和 CF_UNICODETEXT。引入 CF_UNICODETEXT 增加了四种新的转换可能性:与 CF_TEXT 和 CF_OEMTEXT 之间的转换。这些转换依赖于 CF_LOCALE 格式,该格式存储一个语言代码标识符 (LCID),代表语言和区域设置。 LCID 至关重要,因为 ANSI 和 OEM 代码页都可以从中派生。Windows 使用 `LOCALE_IDEFAULT_ANSI_CODEPAGE` 进行与 CF_TEXT 之间的转换,使用 `LOCALE_IDEFAULT_CODEPAGE` 进行与 CF_OEMTEXT 之间的转换,从而有效地在 Unicode 和这些字符集之间进行翻译。 本质上,系统利用 LCID 来确定正确的代码页,以实现准确的文本转换,简化了流程,而无需单独的代码页信息。虽然该系统提供了一个清晰的框架,但也引入了关于其实现的进一步问题,这些问题将在未来的讨论中探讨。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 Windows 如何将 CF_Unicode­TEXT 从 CF_TEXT 以及反之合成? (devblogs.microsoft.com/oldnewthing) 3 分,ibobev 发表于 1 小时前 | 隐藏 | 过去 | 收藏 | 讨论 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

Last time, we started our exploration of how Windows synthesizes text clipboard formats by looking at the conversion between CF_OEM­TEXT and CF_TEXT. Today, we’ll look at what happens when CF_UNICODE­TEXT enters the picture.

The introduction of CF_UNICODE­TEXT means that we now have three clipboard text formats, and therefore six possible conversions. The four new conversions are

  • CF_UNICODE­TEXT to/from CF_TEXT.
  • CF_UNICODE­TEXT to/from CF_OEM­TEXT.

These conversions are done with the assistance of the CF_LOCALE clipboard format, which contains an LCID, which is a 32-bit integer that encodes a primary language (such as German), a sublanguage (such as Swiss-German), and a sort rule (such as phone book). None of these details are directly relevant to character set conversion. The locale is used because both the ANSI and OEM code pages can be derived from the locale, so it’s only one value that needs to be recorded.¹

The system converts to/from CF_UNICODE­TEXT via the code page obtained from the LCID:

  • LOCALE_IDEFAULT­ANSI­CODE­PAGE when converting to/from CF_TEXT.
  • LOCALE_IDEFAULT­CODE­PAGE when converting to/from CF_OEM­TEXT.

Putting all of this into a chart gives us

To From
CF_TEXT CF_OEMTEXT CF_UNICODETEXT
CF_TEXT nop OemToAnsi WC2MB(ANSI CP)
CF_OEMTEXT AnsiToOem nop WC2MB(OEM CP)
CF_UNICODETEXT MB2WC(ANSI CP) MB2WC(OEM CP) nop

In the above table, “ANSI CP” means “the code page reported by calling Get­Locale­Info with the LCID in the CF_LOCALE clipboard format, and the LOCALE_IDEFAULT­ANSI­CODE­PAGE locale attribute”. Similarly for “OEM CP”, using LOCALE_IDEFAULT­CODE­PAGE instead of LOCALE_IDEFAULT­ANSI­CODE­PAGE.

That’s great, we have all the answers in a table. But that table raises more questions!

We’ll start answering questions next time.

¹ This CF_LOCALE clipboard format existed in 16-bit Windows as well, but it wasn’t really used for anything. The people who added Unicode support to the clipboard realized, “Hey, the thing we need is already here! We just have to start using it.”

联系我们 contact @ memedata.com