Recently I have been working on benchmarks comparing techniques for calculating UTF-8 sequence lengths.
The method I described in Decoding UTF-8. Part IV: Determining Sequence Length - Counting Leading Bits uses hardware-assisted counting of leading zero bits:
#include <bit>
int utf8_sequence_length(unsigned char lead_byte) {
switch (std::countl_one(lead_byte)) {
case 0: return 1;
case 2: return 2;
case 3: return 3;
case 4: return 4;
default: return 0; // invalid lead
}
}
I was pretty disappointed with the performance of this function. It was processing between 438 MB/s and 462 MB/s of text data which was far less than the “naive” method described at Decoding UTF-8. Part II: Determining Sequence Length - a Straightforward Approach.
As we saw in the Part IV post, the compiler (clang++ 18.1.3, aarch64) emits a small lookup table to avoid branching:
adrp x9, .Lswitch.table.utf8_sequence_length(unsigned char)
add x9, x9, :lo12:.Lswitch.table.utf8_sequence_length(unsigned char)
ldr w0, [x9, w8, uxtw #2]
...
.Lswitch.table.utf8_sequence_length(unsigned char):
.word 1
.word 0
.word 2
.word 3
.word 4
The lookup table generated by the compiler is a degenerate case of a jump table, which is a common optimization for switch-case statements. The “naive” code from Part II uses branching instructions instead and processes over 2000 MB/s. Obviously, the branch predictor did its magic for the case of pure ASCII text, but the performance gap still surprised me.
Playing with the Linux perf tool on a WSL Linux installation turned to be a pretty frustrating experience. It did point me to a very high number of cycles stalled in the back-end which often indicates that CPU is waiting for memory, but I couldn’t get perf annotate to work.
In any case, what I had was enough to let me try disabling jump tables; with
clang++ -O2 -fno-jump-tables --std=c++20
the switch-case part of the function compiles to:
cmp w0, #2
b.gt .LBB0_4
cbz w0, .LBB0_6
cmp w0, #2
b.eq .LBB0_5
.LBB0_3:
mov w0, wzr
ret
.LBB0_4:
cmp w0, #3
ccmp w0, #4, #4, ne
b.ne .LBB0_3
.LBB0_5:
ret
.LBB0_6:
mov w0, #1
ret
Lo and behold, the performance became comparable to the “naive” function from Part II.
I repeated the experiment using GNU g++ for AArch64. In that case, -fno-jump-tables
had no effect because g++ never emitted a lookup table.
Not surprisingly, it turned out I was not the first to come up with this observation. For instance, there is an excellent article by Julian Squires from 2017 on the same topic, but focusing on x86-x64 platform: Are Jump Tables Always Fastest?