书写系统与 Unicode 简介
An Introduction to Writing Systems and Unicode

原始链接: https://r12a.github.io/scripts/tutorial/part2

Unicode字符超出基本多文种平面(BMP)需要不同的UTF编码处理方式。UTF-8和UTF-32使用4字节序列直接映射码点来表示这些字符,而UTF-16则采用“代理对”系统。 由于UTF-16使用16位“码元”(最大值65,535),无法直接表示更高的码点。相反,它利用两个保留范围——高代理和低代理——来*表示*这些字符。一个补充字符被编码为一个高代理*后跟*一个低代理的组合。 这些代理对应始终一起出现,并且在文本处理期间(换行、高亮显示、计数)不得拆分。您不会在UTF-8或UTF-32编码中找到单个代理或这些对。本质上,UTF-16使用这种解决方法来在其16位结构内表示完整的Unicode字符范围。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 写作系统和 Unicode 介绍 (r12a.github.io) 8 分,mariuz 发表于 1 小时前 | 隐藏 | 过去 | 收藏 | 2 条评论 帮助 ovciokko 发表于 10 分钟前 | 下一个 [–] 图片中的文字声称是简体中文,但实际上并不符合中国政府定义的汉字标准字形,更像是日语汉字(假名)的标准字形。回复 ks2048 发表于 21 分钟前 | 上一个 [–] 这个网站长期以来一直是 Unicode 和语言相关主题的瑰宝。链接到顶级页面同样很好:https://r12a.github.io/回复 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

In UTF-32, characters in the supplementary character range are encoded in bytes that correspond directly to the code point values. For example, U+10330 GOTHIC LETTER AHSA is stored as the byte sequence 00 01 03 30. In UTF-8, the character would also be represented using a 4-byte sequence, F0 90 8C B0.

UTF-16, however, wants to represent all characters using 16-bit (2 byte) 'code units', but you can't express 0x10330 (decimal 66,352) as a 16-bit value (the maximum is decimal 65,535). To get around this, UTF-16 uses instead two special, adjacent 1024-character ranges in Unicode referred to as high surrogates and low surrogates. The combination of a high surrogate followed by a low surrogate, when interpreted by the character encoding algorithm used for UTF-16, points to a specific character in a supplementary plane. For example, the Gothic AHSA is represented in UTF-16 as the byte sequence D8 00 DF 30, where D800 is the code point of a high surrogate, and DF30 is the code point of a low surrogate.

You should never encounter a single surrogate character – they should always appear as high+low surrogate pairs. Also, pairs should not be split when wrapping or highlighting text, counting characters, displaying unknown character glyphs, and so on. You should also never normally see surrogate character code points in UTF-8 or UTF-32.

 go to top of page

联系我们 contact @ memedata.com