正则表达式匹配速度提升200倍

正则表达式匹配速度提升200倍
Matching Regexps 200 Times Faster

原始链接: https://eregon.me/blog/2025/03/14/matching-regexps-200-times-faster.html

Benoit Daloze 的博文重点介绍了 Ruby 中的正则表达式，尤其是在 TruffleRuby 中使用时，其性能如何超越 C 代码甚至手工编写的 SIMD 指令。文章关注 `json` gem 的字符串转换过程，比较了使用正则表达式的纯 Ruby 解决方案和优化的 C 扩展。虽然带有 SIMD 的 C 扩展与基线相比有所加速，但 TruffleRuby 的实现利用其高级 JIT 编译器和 TRegex，展现出显著更快的性能，比 C 和 SIMD 版本都快出几倍。`Regexp#match?`、`Time.new(String)` 和 `StringScanner#scan_integer` 的基准测试进一步说明了这种优势。 TRegex 能够将正则表达式编译成有限状态自动机，将其 JIT 编译成原生代码，利用分割，并自动使用 SIMD，这使得正则表达式匹配在 TruffleRuby 上极其高效。这使得更符合习惯且更简单的 Ruby 代码能够实现比底层实现更高的性能。文章最后总结道，在某些情况下，TruffleRuby 上的纯 Ruby 解决方案可以比 C 代码更快、更具表现力，这展示了先进 JIT 编译器的优势。

Hacker News上的一篇讨论围绕着一篇关于优化Ruby正则表达式(regex)速度的文章展开。原帖作者对优化技巧表示赞赏，并建议使用手动编写的解析代码作为替代方案以获得一致的速度，尤其是在具有SIMD访问权限的原生编译语言中。他们认为，与复杂的正则表达式相比，纯代码更简单易于维护，并将其与依赖编译器自动向量化进行了类比，后者可能不可靠。文章作者回应说，链接的C代码很复杂，并强调了正则表达式的简洁性和表达能力。他们承认Ruby缺乏SIMD访问权限，并且编写SIMD很繁琐。虽然他们同意依赖自动向量化的风险，但他们认为正则表达式优化是一个更小、更可预测的领域，在该领域中，使用SIMD几乎总是会有益的，并且不太可能被放弃。另一位评论者建议使用pcre2 gem来获得更大的速度提升，潜在提升幅度高达1000倍。

我将 serde_json 字符串的速度加快了 20% 2024-08-25

Ripgrep 比 grep、ag、Git grep、ucg、pt、sift 更快 (2016) 2023-12-01

（评论） 2023-11-30

替代实施问题 2024-05-14

Ruby：一种很棒的 shell 脚本语言 2024-06-24

原文

Written by Benoit Daloze @eregontp
on March 14, 2025

TLDR: Regular Expressions can be faster than C code and even hand-written SIMD code.

You might have seen @byroot’s excellent blog post series on optimizing the json gem. From the first blog post it’s clear most of the time for generating JSON is spent in generate_json_string() and specifically in convert_UTF8_to_JSON(), i.e., in converting Ruby Strings to JSON Strings.

Because this is such a hot part of JSON generation, it has received a lot of attention. In part 3 @byroot shows a lookup table approach to implement convert_UTF8_to_JSON(), which is now used in the json gem.

There has been a couple attempts since then to optimize it further using SIMD instructions. This is however quite difficult and messy to do in C as it’s similar to writing inline assembly (N times, with N the number of SIMD variants), and it’s not portable so it needs both compile-time and run-time detection to pick the correct SIMD variant.

In this blog post we compare 3 contenders for the job of converting Ruby Strings to JSON Strings.

The pure-Ruby version is so simple it fits in a few lines (and a big Hash literal):

ESCAPE_MAP = {
  "\x0" => '\u0000', "\x1" => '\u0001', "\x2" => '\u0002', "\x3" => '\u0003',
  "\x4" => '\u0004', "\x5" => '\u0005', "\x6" => '\u0006', "\x7" => '\u0007',
  "\b" => '\b', "\t" => '\t', "\n" => '\n', "\xb" => '\u000b',
  "\f" => '\f', "\r" => '\r', "\xe" => '\u000e', "\xf" => '\u000f',
  "\x10" => '\u0010', "\x11" => '\u0011', "\x12" => '\u0012', "\x13" => '\u0013',
  "\x14" => '\u0014', "\x15" => '\u0015', "\x16" => '\u0016', "\x17" => '\u0017',
  "\x18" => '\u0018', "\x19" => '\u0019', "\x1a" => '\u001a', "\x1b" => '\u001b',
  "\x1c" => '\u001c', "\x1d" => '\u001d', "\x1e" => '\u001e', "\x1f" => '\u001f',
  '"' => '\"', '\\' => '\\\\',
}

def utf8_to_json(string)
  '"' + if /["\\\x0-\x1f]/.match?(string)
    string.gsub(/["\\\x0-\x1f]/, ESCAPE_MAP)
  else
    string
  end + '"'
end

The ESCAPE_MAP is a bit verbose but it’s a simple way to exhaustively store how to escape each character which needs to be escaped. The match? is not strictly necessary but it avoids the allocation of a new String for the common case of no character to escape.

We use ruby/json’s string encoding benchmarks with a small diff to run only those:

diff --git a/benchmark/encoder.rb b/benchmark/encoder.rb
index f0a05db..efbac2b 100644
--- a/benchmark/encoder.rb
+++ b/benchmark/encoder.rb
@@ -33,6 +33,8 @@ def implementations(ruby_obj)
 end
 
 def benchmark_encoding(benchmark_name, ruby_obj, check_expected: true, except: [])
+  return unless ["mixed utf8", "mostly utf8"].include?(benchmark_name)
   json_output = JSON.dump(ruby_obj)
   puts "== Encoding #{benchmark_name} (#{json_output.bytesize} bytes)"
 
@@ -40,6 +42,7 @@ def benchmark_encoding(benchmark_name, ruby_obj, check_expected: true, except: [
   except.each { |i| impls.delete(i) }
 
   Benchmark.ips do |x|
+    x.warmup = 5
     expected = ::JSON.dump(ruby_obj) if check_expected
     impls.values.each do |name, block|
       begin

We select the benchmarks we want with an early return and we increase warmup to increase the stability of the benchmark results. We use ONLY=json because the SIMD PR doesn’t have the json_coder variant (anyway both of these give very similar results for the selected benchmarks).

JSON String Escaping Benchmark

The benchmarks are straightforward:

# mixed utf8
JSON.generate([("a" * 5000) + "€" + ("a" * 5000)] * 500)
# mostly utf8
JSON.generate([("€" * 3333)] * 500)

They both dump to JSON an Array of 500 strings and end up generating a JSON result of 5MB.

Here are the results on my machine (AMD Ryzen 7 3700X 8-Core Processor), with frequency scaling disabled and the performance CPU governor:

C extension with CRuby 3.4.2 with YJIT (on ruby/json master):

$ ONLY=json ruby --yjit -Ilib:ext benchmark/encoder.rb
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [x86_64-linux]
mixed utf8    397.953 (± 4.5%) i/s    (2.51 ms/i)
mostly utf8   402.388 (± 0.5%) i/s    (2.49 ms/i)

We will use this as the baseline, specifically the mixed utf8 benchmark to keep things simple.

C extension + SIMD with CRuby 3.4.2 with YJIT (on the SIMD PR):

$ ONLY=json ruby --yjit -Ilib:ext benchmark/encoder.rb
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [x86_64-linux]
mixed utf8      1.498k (± 4.0%) i/s  (667.68 μs/i)
mostly utf8     1.474k (± 1.6%) i/s  (678.55 μs/i)

3.76 times faster, nice!

Pure-Ruby generator with TruffleRuby 24.1.1 JVM (on ruby/json master):

$ ONLY=json ruby -Ilib:ext benchmark/encoder.rb
truffleruby 24.1.1, like ruby 3.2.4, Oracle GraalVM JVM [x86_64-linux]
mixed utf8      8.104k (± 0.6%) i/s  (123.40 μs/i)
mostly utf8     8.070k (± 1.1%) i/s  (123.91 μs/i)

20 times faster than the baseline and 5.4 times faster than SIMD!
And, of course much simpler, this is just a few lines of Ruby code and a Regexp.

If you guessed earlier that the C extension code + SIMD would be fastest, you may be surprised to learn the pure-Ruby version is considerably faster (on TruffleRuby)! That’s the benefit of an advanced JIT compiler which understands Ruby semantics like TruffleRuby’s JIT compiler (called Graal): you get to write nice, idiomatic Ruby and the JIT compiler does the hard work of optimizing it for your target platform.

For comparison, here’s the pure-Ruby generator with CRuby 3.4.2 with YJIT (on ruby/json master):

$ ONLY=json ruby --yjit -Ilib:ext benchmark/encoder.rb
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [x86_64-linux]
mixed utf8     39.551 (± 0.0%) i/s   (25.28 ms/i)
mostly utf8    28.119 (± 0.0%) i/s   (35.56 ms/i)

Unfortunately, it’s 10 times slower than the baseline, demonstrating the Regexp approach is much slower on CRuby. But, we ran with YJIT enabled, so why is it so much slower?

As a disclaimer, TruffleRuby (which uses the pure-Ruby generator) is currently slower than CRuby using the C extension on JSON.generate macro benchmarks (except canada.json where it is faster). Other aspects of JSON generation need to be better optimized on TruffleRuby.

Regexp#match? Benchmark

Let’s simplify the benchmark to just Regexp#match? to understand better:

require 'benchmark/ips'

mixed_utf8 = ("a" * 5000) + "€" + ("a" * 5000)

Benchmark.ips do |x|
  x.warmup = 5

  x.report '/["\\\x0-\x1f]/.match?(mixed_utf8)' do
    raise if /["\\\x0-\x1f]/.match?(mixed_utf8)
  end
end

$ ruby --yjit bench_regexp_match.rb
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [x86_64-linux]
20.700k (± 2.4%) i/s   (48.31 μs/i)

$ ruby bench_regexp_match.rb
truffleruby 24.1.1, like ruby 3.2.4, Oracle GraalVM JVM [x86_64-linux]
 4.699M (± 0.3%) i/s  (212.83 ns/i)

TruffleRuby is 227 times faster than CRuby at Regexp matching on this benchmark! TruffleRuby matches 10003 bytes in 213 nanoseconds, that is 47 bytes per nanosecond, which is greater than the number of CPU cycles we have per nanosecond (this processor has a max frequency of 4.4 GHz), so what’s going on here?

How can Regexp matching be so fast on TruffleRuby?

There are multiple reasons:

Regexps on TruffleRuby use TRegex, a state-of-the-art Regular Expression Engine:
TRegex compiles regexps to finite-state automata, meaning it never goes back in the input string and will read each character at most once. Therefore, it always runs in linear time. CRuby uses backtracking which may go over the input string characters many times, which is much slower.
Backtracking is prone to ReDoS when backtracking too much while finite-state automata is immune to ReDoS. CRuby 3.2+ has a protection for most regexps against ReDoS by using some form of caching. While that caching avoids many ReDoS, it also slows down the regexp execution by needing to check or fill this cache every time it might backtrack.
TRegex JIT-compiles regexps to native code. CRuby compiles the regexp to regexp bytecode and then uses an interpreter to run that bytecode. That means it has dispatch overhead, does not optimize across bytecodes, etc. YJIT does not know anything about that regexp interpreter, it sees it as a blackbox because it is all written in C.
TruffleRuby uses splitting so that each call site for Regexp#match? and other Regexp-related methods has its own copy of the logic. With that, TruffleRuby is able to inline the JIT-compiled code for that specific Regexp in the Ruby method calling match?.
TRegex automatically uses SIMD, and specifically the best SIMD variant available on the running processor. It is able to do so because of the JIT compiler, which queries the CPU for which SIMD variants are available. TRegex does not need to dispatch between different SIMD variants like C would since the JIT compiler knows the best SIMD variant available. So we get to use SIMD instructions, but we don’t have to complicate a Ruby gem to benefit.

Specifically for this Regexp TRegex internally uses a ByteIndexOfCodePointSetNode, which automatically creates a lookup table for the character range in that regexp (["\\\x0-\x1f]) and also automatically uses SIMD. If you would like to know more about TRegex, the creator of TRegex and I did a talk about it at RubyKaigi 2021. Kevin Menard also gave a talk about ReDoS and solutions at RailsConf 2022.

The JSON String Escaping example above is not an isolated case where TruffleRuby coincidentally is faster. Having a fast regular expression engine allows for more idiomatic solutions and allows Ruby code running on TruffleRuby to be faster than C code in many cases. Here are two more real world examples where TruffleRuby uses Regexps and it’s faster than C code!

Time.new(String)

Time.new since Ruby 3.2 accepts a String argument.

CRuby implements it by parsing strings with char pointers in about 80 lines of C. TruffleRuby implements it in a straightforward way with a Regexp.

require 'benchmark/ips'

NOW_STRING = Time.now.to_s

Benchmark.ips do |x|
  x.report("Time.new(String)") do
    Time.new(NOW_STRING)
  end
end

$ ruby --yjit bench_time_new_string.rb
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [x86_64-linux]
Time.new(String)      1.595M (± 0.5%) i/s  (626.84 ns/i)

$ ruby bench_time_new_string.rb
truffleruby 24.1.1, like ruby 3.2.4, Oracle GraalVM JVM [x86_64-linux]
Time.new(String)      3.923M (± 5.4%) i/s  (254.90 ns/i)

TruffleRuby is 2.5 times faster here, while having a much simpler and cleaner implementation.

Let’s benchmark just the Regexp, although we use match here since we need the capture groups:

require 'benchmark/ips'

NOW_STRING = Time.now.to_s

TIME_REGEXP = /\A (?<year>\d{4,5})
(?:
  - (?<month>\d{2})
  - (?<mday> \d{2})
  [ T] (?<hour> \d{2})
     : (?<min>  \d{2})
     : (?<sec>  \d{2})
     (?:\. (?<usec> \d+) )?
 (?:\s* (?<offset>\S+))?
)?\z/x

Benchmark.ips do |x|
  x.report("TIME_REGEXP.match(NOW_STRING)") do
    raise unless TIME_REGEXP.match(NOW_STRING)
  end
end

$ ruby --yjit bench_time_new_string_regexp.rb
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [x86_64-linux]
TIME_REGEXP.match(NOW_STRING)      1.234M (± 0.8%) i/s  (810.47 ns/i)

$ ruby bench_time_new_string_regexp.rb
truffleruby 24.1.1, like ruby 3.2.4, Oracle GraalVM JVM [x86_64-linux]
TIME_REGEXP.match(NOW_STRING)     33.448M (± 9.9%) i/s   (29.90 ns/i)

TruffleRuby is 27 times faster for this Regexp. So the Time.new(String) benchmark is actually spending a lot of time in creating and initializing the Time object.

StringScanner#scan_integer

StringScanner#scan_integer is a method recently added to StringScanner.

CRuby implements it in 114 lines of C. TruffleRuby implements it in 18 lines of Ruby using Regexps. We use TruffleRuby master here since the method was introduced after the last TruffleRuby release.

require 'benchmark/ips'
require 'strscan'

SCANNER = StringScanner.new("123456789")

Benchmark.ips do |x|
  x.report("StringScanner#scan_integer") do
    SCANNER.reset
    raise unless 123456789 == SCANNER.scan_integer
  end
end

$ ruby --yjit bench_scan_integer.rb
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [x86_64-linux]
StringScanner#scan_integer     10.530M (± 0.6%) i/s   (94.97 ns/i)

$ ruby bench_scan_integer.rb
truffleruby 25.0.0-dev-48a0a626, like ruby 3.3.5, Oracle GraalVM JVM [x86_64-linux]
StringScanner#scan_integer     30.230M (± 3.0%) i/s   (33.08 ns/i)

TruffleRuby is about 3 times faster. BTW, TruffleRuby implements StringScanner entirely in Ruby code. Ruby code is not only shorter and more expressive, it’s also more correct. To that point, while implementing new StringScanner methods we found 6 issues with the C extension implementation of StringScanner: 2 segfaults and 4 inconsistent/incorrect behaviors.

Let’s also benchmark just the Regexp for completeness:

require 'benchmark/ips'

INPUT = "123456789"

Benchmark.ips do |x|
  x.report("/\A[+-]?\d+/.match(INPUT)") do
    raise unless /\A[+-]?\d+/.match(INPUT)
  end
end

$ ruby --yjit bench_scan_integer_regexp.rb
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [x86_64-linux]
/A[+-]?d+/.match(INPUT)      1.972M (± 0.8%) i/s  (507.00 ns/i)

$ ruby bench_scan_integer_regexp.rb
truffleruby 25.0.0-dev-48a0a626, like ruby 3.3.5, Oracle GraalVM JVM [x86_64-linux]
/A[+-]?d+/.match(INPUT)     61.086M (± 6.9%) i/s   (16.37 ns/i)

TruffleRuby is 31 times faster for this Regexp.

Conclusion

Sometimes, such as in the cases shown in this blog post, the idiomatic pure-Ruby solution is also the fastest and even faster than C code or inline assembly. This is not new on TruffleRuby, pure-Ruby code is often faster than everything else, including C extensions on CRuby. Ruby code is more expressive and higher-level which makes some optimizations possible that otherwise wouldn’t be if written in a lower-level language.

If you want to run idiomatic Ruby code fast, give TruffleRuby a try.