显示 HN:GNU Grep 作为 PHP 扩展
Show HN: GNU grep as a PHP extension

原始链接: https://github.com/hparadiz/ext-gnu-grep

## grep PHP 扩展摘要 `grep` 是一个全新的 PHP 扩展,作为 C 模块构建,使用了 GNU `grep` 的源代码(GPLv3 许可)。它*不是*围绕命令行 `grep` 工具的包装器,而是一个原生 PHP 模块,直接暴露 `grep` 功能。 该扩展目前专注于构建一个强大的匹配引擎,支持固定字符串和正则表达式搜索(包括常见的 PHP 正则表达式简写),通过 GNU `grep` 的内部机制实现。主要功能包括递归搜索、二进制文件处理和可定制的输出格式。一个核心的 `ggrep()` 函数提供了一个简化的入口点,用于常见的 `grep` 风格操作。 开发优先保证与上游 `grep` 引擎的兼容性,并通过广泛的基准测试和正确性测试,与独立的 GNU `grep` 构建进行验证。未来的工作包括实现剩余的 CLI 选项(如 `-P`),更丰富的文本渲染,以及扩展 API 以支持更高级的用例。该项目提供了用于构建和比较扩展性能与原始 `grep` 实现的工具。

## GNU Grep 作为 PHP 扩展 - 摘要 一个新的 PHP 扩展,将 GNU Grep 的强大功能直接引入 PHP 代码,已经在 Hacker News 上分享。该扩展由 hparadiz (github.com/hparadiz) 开发,旨在为 PHP 应用程序提供高效的文本搜索功能。 然而,最初的反馈集中在许可选择上——GPL-v3-or-later,这可能会阻止那些拥有专有项目的公司采用。一位评论员指出,MIT 许可在这些情况下会更具吸引力。 另一个提出的观点是需要一个明确的比较,展示该扩展相对于仅仅使用 `shell_exec` 运行 `grep` 的优势。开发者需要主动展示性能优势以鼓励采用,因为用户不太可能自行进行基准测试。该扩展的许可受到其直接改编 GNU Grep 代码的限制。
相关文章

原文

grep is a greenfield PHP extension project implemented in C and built as a shared object with phpize.

This repository keeps the upstream GNU grep source in vendor/grep and builds a native PHP module around a separate extension entrypoint. It is not a PHP userspace wrapper around the grep CLI.

Vendored upstream commit:

  • 071ac3aa76a575dd55dc184570da2192adafe267

GNU grep is GPLv3-or-later. If this extension links against, embeds, or adapts GNU grep internals, the resulting combined work has GPL implications for distribution. That constraint is intentional and should stay explicit in project documentation and release artifacts.

The repository-level license notice is in LICENSE, and the full GNU GPLv3 text is vendored in vendor/grep/COPYING.

The tree now contains a real PHP extension skeleton:

  • config.m4
  • php_grep.h
  • php_grep.c
  • tests/*.phpt
  • vendor/grep

The current vertical slice is intentionally small:

  • module loads as grep.so
  • exposes grep_version(): array
  • exposes GNUGrep\Engine
  • exposes GNUGrep\Pattern
  • supports fixed-string matching via GNU grep's upstream Fcompile/Fexecute
  • supports basic and extended regular expressions via GNU grep's upstream GEAcompile/EGexecute
  • supports common PHP-style regex shorthands like \d, \D, \s, \S, \w, \W, \h, and \H on the GNU basic/extended regex path
  • supports a substantial GNUGrep\Engine::run() slice for prominent grep --help switches

Implemented run() option slices:

  • pattern modes: -G, -E, -F
  • matcher controls: -i, -v, -w, -x
  • recursive search: -r, -R, -d skip|recurse
  • binary handling: -I, -a, -U, --binary-files=without-match|text|binary
  • result shaping: -n, -c, -l, -L, -m
  • output controls: -b, -H, -h, -o, -Z
  • pattern sources: -e, -f
  • file selection: --include, --exclude, --exclude-dir, --exclude-from
  • context controls: -A, -B, -C, -NUM, --group-separator, --no-group-separator
  • stdin and record modes: --label, -z

PCRE mode, richer text-rendering flags like -T and --line-buffered, and colorized CLI formatting are still follow-up work. The extension is being built out engine-first, with matcher parity and benchmark harnesses added slice by slice.

tools/build_upstream_grep.sh
phpize
./configure --enable-grep
make

The built module will be written to modules/grep.so.

The PHPT suite uses --EXTENSIONS-- grep, so the tests execute against the freshly built module.

Compare Against Upstream GNU grep

Build standalone upstream GNU grep from the vendored source:

tools/build_upstream_grep.sh

Then run a side-by-side correctness and timing check against the extension:

tools/compare_with_upstream.sh

That harness:

  • generates a deterministic benchmark fixture tree
  • runs standalone GNU grep with -RnI 'abstract class'
  • runs the extension on the same fixture
  • diffs the normalized outputs
  • records repeated wall-clock timings for both paths

That benchmark now exercises the global ggrep() helper for the extension side, so it measures the actual short userspace entrypoint instead of an older internal helper path.

If you want to benchmark in-memory grep work directly, use:

php -n -d extension=modules/grep.so tools/benchmark_ggrep_pipe.php \
  '-iE fatal|panic|timeout' \
  /path/to/captured-output.log \
  100 \
  'captured-output'

That is useful for shell_exec() / pipeline-style usage where startup and filesystem traversal are not the main cost.

php -d extension=/absolute/path/to/modules/grep.so -r 'var_dump(grep_version());'

For a full PHP-visible reference, see docs/USERSPACE_API_REFERENCE.md.

<?php

$pattern = GNUGrep\Pattern::fixedString('TODO');

var_dump($pattern->matches("TODO: wire GNU grep internals\n"));
var_dump(GNUGrep\Engine::versionInfo());
var_dump(GNUGrep\Engine::match(
    'abstract class (Alpha|Beta)Base',
    "abstract class BetaBase\n",
    GNUGrep\Pattern::MODE_EXTENDED_REGEXP
));
var_dump(GNUGrep\Engine::run(['-RnI', 'abstract class', __DIR__ . '/src']));
var_dump(GNUGrep\Engine::run(['-Rniw', 'model', __DIR__ . '/src']));
var_dump(GNUGrep\Engine::run(['-Rn', '-e', 'alpha', '-e', 'beta', __DIR__ . '/src']));

ggrep() is now the shortest userspace entrypoint. Pass GNU grep-style args as a string or array, then give it either paths or in-memory text:

<?php

$literalMatches = ggrep(
    '-F lamb',
    'Mary had a little lamb'
);

$errorMatches = ggrep(
    '-iE fatal|panic|timeout',
    shell_exec('php artisan about 2>&1') ?? '',
    'artisan about'
);

$httpBlob = <<<HEADERS
GET /checkout HTTP/1.1
Host: payments.internal.example
Authorization: Bearer redacted-token
X-Forwarded-For: 203.0.113.9
X-Request-Id: req-7f3a

HTTP/1.1 503 Service Unavailable
Set-Cookie: session=secret; HttpOnly; Secure
Content-Security-Policy: default-src 'self'
Strict-Transport-Security: max-age=31536000
CF-Ray: 89abc123-sea
HEADERS;

$headerMatches = ggrep(
    '-niE ^(Host|Authorization|X-Forwarded-For|X-Request-Id|Set-Cookie|Content-Security-Policy|Strict-Transport-Security|CF-Ray):',
    $httpBlob,
    'checkout trace'
);

$tokenMatches = ggrep(
    ['-nE', '-e', '\d+', '-e', 'alpha\w+', '-e', 'space\shere'],
    "id=42\nslug=alpha_beta\nspace here\n",
    'token demo'
);

// Folder search, equivalent to a practical grep -RnI style search.
$classMatches = ggrep(
    '-RnI abstract class',
    __DIR__ . '/src'
);

// PHP 8.5 pipe operator works cleanly with a tiny wrapper.
$findLamb = fn(string $input): array => ggrep('-F lamb', $input);
$pipedMatches = 'Mary had a little lamb' |> $findLamb(...);

On the GNU basic and extended regex modes, the extension also accepts common PHP-style shorthand tokens such as \d, \D, \s, \S, \w, \W, \h, and \H.

For the common "grep folders like grep -RnI" case, the class helpers still exist:

<?php

use GNUGrep\Engine;
use GNUGrep\Pattern;

$matches = Engine::grep('abstract class', __DIR__ . '/src');

$matches = Engine::grep('BetaLeaf', [
    __DIR__ . '/src',
    __DIR__ . '/tests',
], Pattern::MODE_FIXED_STRING);

$matches = Engine::grepFixed('TODO', [
    __DIR__ . '/src',
    __DIR__ . '/docs',
]);

These helpers assume:

  • recursive traversal
  • line-numbered results
  • binary files treated as without-match
  • -R-style recursive directory handling

Use ggrep() when you want the shortest userspace form. Use GNUGrep\Engine::run(array $argv) when you want exact CLI-style argv control. Use the class helpers when you want explicit path-only or buffer-only intent.

Example: Search PSR-4 Autoload Trees

If you want a quick inventory of PHP type declarations across one or more PSR-4 autoload roots, point GNUGrep\Engine::run() at those folders and search for the declaration forms you care about:

<?php

use GNUGrep\Engine;

$autoloadRoots = [
    __DIR__ . '/src',
    __DIR__ . '/modules/Billing/src',
];

$matches = Engine::run([
    '-RnI',
    '-E',
    '-e', '^(abstract|final|readonly)[[:space:]]+class[[:space:]]+',
    '-e', '^class[[:space:]]+',
    '-e', '^interface[[:space:]]+',
    '-e', '^trait[[:space:]]+',
    '-e', '^enum[[:space:]]+',
    '--include=*.php',
    ...$autoloadRoots,
]);

$lines = $matches
    |> (static fn(array $rows): array => array_map(
        static fn(array $match): string => sprintf(
            '%s:%d %s',
            $match['path'],
            $match['line'],
            $match['text']
        ),
        $rows
    ));

echo implode(PHP_EOL, $lines), PHP_EOL;

That gives you a grep-style scan of the PSR-4 code roots while ignoring binary files and non-PHP assets. It is a good fit for codebase audits like "show me every abstract class, interface, trait, enum, or concrete class we autoload from these roots."

  1. Add a native compiled-program abstraction that owns GNU grep matcher state instead of rebuilding per call.
  2. Close the remaining regex bridge gap for anchored ^...$ semantics in the generic non--x path.
  3. Add the remaining major CLI slices such as -P, plus any text-rendering-only flags that need a dedicated formatted-output API instead of structured arrays.
  4. Keep adding parity PHPTs and side-by-side benchmarks before expanding the CLI surface further.
联系我们 contact @ memedata.com