CJK 中 Markdown 强调问题:分析 CommonMark 的分隔符规则
Markdown Emphasis Issues in CJK: Analyzing CommonMark's Delimiter Rules

原始链接: https://hackers.pub/@yurume/019b912a-cc3b-7e45-9227-d08f0d1eafe8

## CommonMark Markdown 渲染问题 CommonMark 规范中长期存在的缺陷导致 Markdown 渲染不正确,尤其是在加粗 (** ) 方面,而 LLM 生成的内容暴露了这个问题。问题源于 CommonMark 的“分隔符连续”规则,该规则旨在简化解析,但未能考虑到实际使用情况,尤其是在韩语、日语和中文 (CJK) 等语言中。 这些规则规定,强调标记必须仅根据紧邻的字符(空格或标点符号)是“左侧相邻”还是“右侧相邻”。当标点符号(如括号)紧随在闭合加粗标记之前,后跟一个字母时,就会出现问题,从而导致无法正确渲染。 虽然该规则旨在支持嵌套强调,但它在很少使用空格且标点符号经常集成在单词中的CJK语言中造成了很大的困扰。作者认为,嵌套强调的好处并不值得由此带来的不便,尤其是由于 LLM 现在广泛输出反映自然语言使用的 Markdown,从而突显了此前潜在的问题。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 Markdown CJK 强调问题:分析 CommonMark 的分隔符规则 (hackers.pub) 9 分,birdculture 发表于 1 小时前 | 隐藏 | 过去 | 收藏 | 讨论 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

As Markdown has become the standard for LLM outputs, we are now forced to witness a common and unsightly mess where Markdown emphasis markers (**) remain unrendered and exposed, as seen in the image. This is a chronic issue with the CommonMark specification---one that I once reported about ten years ago---but it has been left neglected without any solution to this day.

The technical details of the problem are as follows: In an effort to limit parsing complexity during the standardization process, CommonMark introduced the concept of "delimiter runs." These runs are assigned properties of being "left-flanking" or "right-flanking" (or both, or neither) depending on their position. According to these rules, a bolded segment must start with a left-flanking delimiter run and end with a right-flanking one. The crucial point is that whether a run is left- or right-flanking is determined solely by the immediate surrounding characters, without any consideration of the broader context. For instance, a left-flanking delimiter must be in the form of **<ordinary character>, <whitespace>**<punctuation>, or <punctuation>**<punctuation>. (Here, "ordinary character" refers to any character that is not whitespace or punctuation.) The first case is presumably intended to allow markers embedded within a word, like **마크다운**은, while the latter cases are meant to provide limited support for markers placed before punctuation, such as in 이 **"마크다운"** 형식은. The rules for right-flanking are identical, just in the opposite direction.

However, when you try to parse a string like **마크다운(Markdown)**은 using these rules, it fails because the closing ** is preceded by punctuation (a parenthesis) and it must be followed by whitespace or another punctuation mark to be considered right-flanking. Since it is followed by an ordinary letter (), it is not recognized as right-flanking and thus fails to close the emphasis.

As explained in the CommonMark spec, the original intent of this rule was to support nested emphasis, like **this **way** of nesting**. Since users typically don't insert spaces inside emphasis markers (e.g., **word **), the spec attempts to resolve ambiguity by declaring that markers adjacent to whitespace can only function in a specific direction. However, in CJK (Chinese, Japanese, Korean) environments, either spaces are completly absent or (as in Korean) punctuations are commonly used within a word. Consequently, there are clear limits to inferring whether a delimiter is left or right-flanking based on these rules. Even if we were to allow <ordinary character>**<punctuation> to be interpreted as left-flanking to accommodate cases like **마크다운(Markdown)**은, how would we handle something like このような**[状況](...)は**?

In my view, the utility of nested emphasis is marginal at best, while the frustration it causes in CJK environments is significant. Furthermore, because LLMs generate Markdown based on how people would actually use it---rather than strictly following the design intent of CommonMark---this latent inconvenience that users have long felt is now being brought directly to the surface.

联系我们 contact @ memedata.com