“随处”可用的正则表达式

“随处”可用的正则表达式
Regular expressions that work "everywhere"

原始链接: https://www.johndcook.com/blog/2026/06/23/regex-everywhere/

正则表达式在不同平台间的差异众所周知，这往往会导致一些在 Perl 重度工作流中预期的功能在受限环境下失效，从而带来困扰。为了确保可移植性——尤其是在无法安装新软件的系统中工作时——最好依赖于一套标准化的“通用”正则表达式子集。尽管 `sed`、`awk`、`grep` 和 `Emacs` 等工具各有其独特之处（例如 Emacs 对特殊字符使用反斜杠的依赖，或 `awk` 在处理单词边界时的独特语法），但在这些环境中，有一组核心功能通常是通用的。这个“通用”子集包括： * **锚点与字面量：** `.`、`^`、`$`、`[...]`、`[^...]` * **量词与逻辑：** `*`、`+`、`?`、`{n,m}` 以及多选一（`|`） * **捕获与引用：** 圆括号与向后引用（`\1`–`\9`） * **字符类：** `\w`、`\W`、`\s`、`\S` * **边界：** `\b`、`\B` 通过坚持使用这些通用基础，你可以编写出在大多数标准类 Unix 工具中保持功能可用且具备可移植性的代码，而无需额外的依赖。

``` Hacker News 最新 | 过往 | 评论 | 提问 | 展示 | 招聘 | 提交登录 “随处”可用的正则表达式 (johndcook.com) 10 点，由 ColinWright 发布于 1 小时前 | 隐藏 | 过往 | 收藏 | 2 条评论帮助 MathMonkeyMan 21 分钟前 | 下一条 [–] 我一直很坚持要明确指出你的工具所支持的正则表达式语言，以及它是用于匹配任意子串、前缀、后缀、整个字符串、行、行内子串，还是其他什么。这里列出了一些[较常用的][1]，此外还有 PCRE 和 Python。我花了一段时间才了解到，你在 grep 等工具中看到的一些旧版本规范是 [POSIX 指定的][2]。 [1]: https://cppreference.com/cpp/regex#Regular_expression_gramma... [2]: https://pubs.opengroup.org/onlinepubs/009696899/basedefs/xbd... 回复 Resonix 1 小时前 | 上一条 [–] 为什么我构建了这个回复准则 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索： ```

原文

The most frustrating aspect of regular expressions is that implementations vary. Features supported in one tool may not be supported at all in another tool, or they may be supported with slightly different syntax.

I learned regular expressions in the context Perl, a maximalist regex environment. This led to frustration when features I expect to work are missing [1]. One way around this is to use Perl analogs of other tools, but this is very non-standard. I want to be able to send colleagues and clients code that works out of the box.

As I mentioned in my post on computational survivalism, I occasionally need to work on computers that I cannot install software on. So a better approach is to identify a subset of regex features that work everywhere. The stricter your definition of “everywhere” the less this includes. The strictest subset would be

literals
character classes […]
the special characters . * ^ $

A more relaxed definition of “everywhere” would be the tools you most care about. Currently the tools I most want to use with regular expressions are sed, awk, grep, and Emacs.

Awk as lowest common denominator

If you use the Gnu versions of sed, awk, and grep, and use the -E option with sed and grep, then the list of common features is bigger. The regular expression features of of the three tools are similar, and awk’s features are supported in the other tools, with one exception: word boundaries in awk are \< and \> rather than \b and \B.

I wrote about Awk’s regex features here.

Emacs as the oddball

Emacs supports analogs of most of awk’s regex features. However, the characters

    + ? ( ) { } |

all require a backslash in front in order to act like the awk counterparts. Also, the analog of \s and \S in awk is \s- and \S- in Emacs.

Instead of meaning space or nonspace, \s and \S in Emacs begin a (negated) character class, and one of those classes is - for space. But there are many others. For example, \s. stands for a punctuation character and \S. stands for a non-punctuation character.

What works everywhere

So for my definition of “everywhere,” with the caveats mentioned above, the following features work everywhere. YMMV.

    .
    ^, $
    […], [^…]
    *
    \w, \W, \s, \S
    \1 - \9 backreferences
    \b \B
    ? + 
    | alternation
    {n,m} for counting matches
    (...) capturing

One footnote is that gawk supports backreferences in replacement strings but not in regular expressions per se.

[1] To some extent, basic Perl features work elsewhere and advanced features do not, depending on your idea of what is basic or advanced. I think of look-around features as advanced, and that tracks. But I think of \d for digits as basic, but that’s not supported in many regex flavors.