PivCo-Huffman “合并”操作

原文

There’s a new paper out called “PivCo-Huffman” (HTML version with annotations here) and it’s very interesting. Normal Huffman decoding (and, to a lesser extent, encoding) is inherently quite serial. We can get explicit parallelism by using multiple streams, which scales just fine to moderate numbers of streams – something like 4-8 is usually not an issue. Not very suitable for vectorization or wide vector machines like GPUs, though: every extra stream adds signaling overhead in the bitstream, and decoding from many streams at once is a gather-heavy operation. GPUs can cope with gathers okay (although spread-out data accesses still consume a lot more cycles than more compact memory access patterns do), most everything else is very unhappy with extremely gather-heavy patterns.

Especially since our usual table-based decoders will then immediately follow up with another gather, although that part is more easily fixed: canonical codes admit reasonably easy table-less decoding (at least for the critical “length determination” part), if desired. These approaches are not particularly juicy in a scalar one-at-a-time decoder, where a single load is very cheap, but if you’re working on 32-wide vectors, doing a bunch of integer math that saves you a gather is a more interesting proposition.

You can also use ANS-style interleaving to turn many logical bitstreams into a single physical bitstream. This is what GDeflate does to get a bitstream that has more potential parallelism without gathering from all over the place. Interleaving solves the “reads all over memory” problem and works great on GPUs and CPUs that provide suitable facilities (e.g. AVX-512 “EXPAND” family instructions, or just fast gathers from the same cache line), but is slow without that kind of hardware support. It also forces you to pick a magic number (the interleave factor) as part of the bitstream definition. Too low, and wide vector machines like GPUs get bad utilization when decoding. Too high, and scalar machines can’t efficiently decode the bitstream at all. Worse, numbers that are just right for say AVX-512 or GPUs overlap, but the smallest number of interleaved streams most GPUs or even CPUs with something like AVX-512 can get decent utilization with is around 32, the largest number scalar or narrower vector machines with typical register counts can comfortably decode is around 8-16 (depending on how exactly you do it), and in the middle between 16 and 32 is a valley of tears where nobody’s happy.

That’s why you don’t see much of this style of bitstream outside of GDeflate; it is inherently reliant on you picking a magic number that has a huge influence on every single implementation, yet at the same time is subject to ever-shifting hardware details. Yes, NVidia GPUs have been built around warps with 32 “threads” (more like SIMD lanes) using 32 bits per register each for a good while, but AMD used to prefer 64 (before switching to mostly-32 a few generations ago), Intel GPUs were designed around 8, some mobile GPUs used to do as little as 4, and 4 is also the number of 32-bit lanes you get for something like ARM NEON or SSE4.2 on the x86 side, later widened to 8 with AVX2. Finally, ARM SVE and the RISC-V V extension leave it implementation defined, which is one thing when the model is that you’re running a simple vector math kernel like a SAXPY, but quite another when the physical vector width ends up being a magic constant that is baked into some persistent wire format, potentially for decades.

The final major popular solution to the “this is very serial” conundrum is to completely brute-force it. Fetch a bunch of bits from the input. In parallel, start decoding at bit 0, and at a bit 1, at bit 2, 3, and so forth. Do this for a lot of sequential candidate positions. I’m talking something like 16 or more.

Eventually, we discover that the code starting at bit 0 was 6 bits long. Cool! That means we discard all the work we did at bit offsets 1 through 6, and look at what our already pre-decoded code starting at bit 6 was. That one is 5 bits long, so we discard the speculative work we did for bits 7 through 10, and find that starting at bit 11, there’s another code that’s 5 bits long. Congratulations, we’ve just decoded 3 values for the price of 16.

This works just fine. Believe it or not, this is actually the preferred method of decoding such a sequential bitstream in hardware if a throughput of more than one value per cycle is required. Especially with canonical codes and not-too-large length limits, this can be made reasonably cheap. Chasing down the sequence of actual values to emit at the end is a potential scalability problem, but 3 or 4 at a time isn’t too bad, and if that’s not enough, there’s better algorithms to chase these links that give yet more parallelism. In short, this can be done, and if you build dedicated hardware for it, it’s not bad, but it’s not something you ever want to do in SW on general-purpose hardware. I’ve implemented it out of curiosity, and even on a GPU, it turns out routinely throwing away more than 80% of the work you do is not a recipe for fast code, and let’s not even talk about wasted energy when you try to pull this kind of stunt on a power-constrained battery-powered device.

So far, that’s been our main options: we can decode sequentially one at a time, but that doesn’t scale. We can use multiple separate bitstreams, which theoretically gives us something like thread-level or perhaps instruction-level parallelism, but that amount of parallelism is baked into the bitstream and adds some amount of overhead (since each decoder threads needs to know where to start). It’s also not friendly to SIMD-style implementations because the divergent memory access patterns make us sad. We can use interleaved bitstreams which have much better memory access locality and are a good match to SIMD and GPU architectures in principle, but need us to pick a magic number at bitstream specification time that will forever limit the amount of decoder parallelism we can get, but also ruin our day if the number we pick is high enough to make some of our target HW run out of registers. And even within the same class of device (say, GPUs), there’s not anything resembling agreement on what that number ought to be. Finally, we can shrug, be extremely wasteful, and discard most of the work we did immediately. It’s parallelism all right, but it’s hardly work-efficient.

Enter PivCo

Marcin Żukowski’s “Pivco-Huffman” turns this on its side. Literally. I’ll go through this rather quickly to get to the meat of this post, refer to the paper for more details. For concreteness, suppose we want to encode the string “abracadabra”. Building the standard Huffman tree (it’s a textbook algorithm that I’m not going to explain here) yields something like this picture:

Ordinarily, to code any symbol, you start at the root and send the labels of the edges you need to take to get to the appropriate leaf of the tree. The “a” encodes as just “0” in this tree, “b” encodes as “100”, r as “101” and so forth. We send each symbol fully before moving on to the next one.

PivCo-Huffman instead conceptually processes the entire string at once, pushing it slowly down the tree. Starting at the root node, we see that every position that contains an “a” sends a “0”, and everything else sends a “1”. Lining the string up, we can see what bits are output for the root node:

  abracadabra
  01101010110

These bits go straight into our output bitstream. The left subtree of the root node is just the “a” leaf, so every position that sent a 0 is effectively done. But the positions that weren’t “a” aren’t done yet. Lets work out what those are, and what the next step they need to take is:

  brcdbr
  001100

These are the bits output for the right child of the root node. In general, at every internal node, we partition the symbols that got this far into two smaller subsequences: one that descends into the left subtree and one that descends into the right subtree. In our running example, the subsequence that continues left is “brbr”, while the right subsequence is just “cd”. We proceed our preorder tree traversal this way until we’ve pushed everything down to the leaves.

Note this is sending the exact same bits regular Huffman coding using this tree would, just in a different order.

To decode this, we need to know how many symbols are encoded in our bitstream, let’s say N. Every symbol must have passed through the root node, so we know the bit string for the root node has N bits in it. We can count how many 0s and 1s there are in that bit string, which tells us how many symbols make it into the left (0) and right (1) subtrees. Then, recursively, we can do the same thing with these nodes. This lets us locate where the individual nodes’ bits end up in the final bitstream, and how many symbols pass through them.

Finally, on the way back up the tree, now that we have the lengths of the subsequences, we can reconstruct what they might have been in a bottom-up merge process.

Take the parent node for the “b” and “r” leaves, for example. We know that 4 symbols “pass through there”, and the corresponding bitstream turns out to be “0101”. This means that the subsequence of symbols coded in this subtree is “brbr”. Its sibling node just has two bits, “01”, which encodes “cd”.

Their common parent is the 6-symbol node emitting the “001100” bit string that we looked at earlier. We just worked out that the 4 symbols coded by its left (“0”) subtree are “brbr”, and the 2 symbols coded by the right (“1”) subtree are “cd”. That means the symbols coded by the entire node are “brcdbr”. Finally, we keep pushing this up one more level to merge in the a’s, and we get “abracadabra” back.

Fine, but why would we ever do this? This just seems like regular Huffman coding, but in batch-processing mode and with extra steps.

The answer is that we’ve transformed Huffman encoding into a sequence of in-order list partitioning steps, and decoding into a sequence of list merges where instead of doing comparisons, we get a bitstream telling us which of the two lists to take the next element from. Regular Huffman encoding isn’t so bad, but decoding it is, as discussed above, an extremely serial process. This type of list merging is not: figuring out which list elements to grab reduces to a prefix sum, which while not trivial, is the prototype for a class of parallel algorithm (scans) that we can do quite well, and it doesn’t need the bitstream to be written with a particular vector width (or similar) in mind. It easily scales both up and down with the capabilities of the target machine.

There are many other possible optimizations: for example, the tree I showed contains a complete binary tree for the 4 symbols “b”, “r”, “c” and “d”. Sure, you could encode (or decode) this as a sequence of 3 binary list partitions (or merges, on the decoder side), but you could also just send 2 bits per symbol at once and be done with it. Variable-length-to-fixed-alphabet decoding (like Huffman does) is tricky, but a subset that is fixed-length-to-fixed-alphabet is trivially parallelized. If we don’t want to scan over all of these lists to figure out how many “1”s there are, we can store this information for some or all of the Huffman tree nodes, spending extra bits, but saving the decoder having to scan over (potentially) tens of thousands of bits. And so forth.

So while the baseline scalar implementation of this idea is not very interesting, it sits squarely in the subset of tasks that can be mapped well to essentially any data-parallel substrate, be it GPUs, vector instruction sets of any stripe, or even just threads, which makes it worth playing with. And certainly the numbers attained in the paper are encouraging, with the major caveat that the main results are, not surprisingly, on either x86-64 servers with AVX-512 or Apple Silicon CPUs – i.e. very high-end hardware.

This bodes well for the future of the algorithm, but wire formats need to work for everyone, not just the highest-end targets.

Hence the question: how cheap can we get the basic primitives? In this post, I’ll be looking at the “merge” operations in the bottom-up decoder specifically.

The merge operation

Here’s what one of those merges boils down to, in Python-like pseudocode:

  def merge(l_list, r_list, bitstream):
      output = []
      l_pos, r_pos = 0, 0
      for bit in bitstream:
          if bit == 0:
              output.append(l_list[l_pos])
              l_pos += 1
          else:
              output.append(r_list[r_pos])
              r_pos += 1
      return output

That’s all it is. That’s what we’re working with. Note that this algorithm has the invariant that l_pos + r_pos equals the number of elements processed (and output elements written) so far, so really, tracking both of them is overkill. Either implies the other. And r_pos is the (exclusive) prefix sum of all bit values encountered so far.

As noted above, this essentially reduces to a mid-semester homework problem in parallel programming 101. It’s a very simple algorithm, which is good news for us because we really need something fast, and when something is this straightforward, there’s usually many different ways of doing it, and we can pick whatever works best on our current target. But we’ll need something a bit lower-level than our quasi-Python for the remainder of this. In particular, we need to pick some actual data types. In the following, I’ll be assuming that the symbols we’re trying to code are bytes, which is both practically relevant and consistent with my earlier posts on the subject. Our bitstream will be bit-packed in the obvious way.

The most direct match exists with something like AVX512_VBMI2, which offers VPEXPANDB, giving us a type of byte-level masked load operation where active SIMD lanes (where the corresponding mask bit is 1) load from sequential addresses and inactive lanes are either left alone or zero-initialized, without the load address increasing. It’s intended to load sparse data into a vector register and is essentially a perfect match for what we need. I’ll not use the official intrinsic names in the following because the notation is rather intimidating if you’re not used to it, but the process looks something like this, switching from Pseudo-Python to Pseudo-C:

  void merge(uint8* output,
      const uint8* l_list, const uint8* r_list,
      const uint8* bits, usize count)
  {
      // None of these code snippets will handle boundary cases like the
      // last few elements, nor will they do any bounds checking.
      // A real implementation absolutely needs to do this, but that's for
      // real code, not pseudocode in a blog post. You've been warned.

      // AVX-512 can process 64 bytes per iteration!
      for (usize i = 0; i < count; i += 64)
      {
          // Mask has bits set for elements coming from r_list
          // (LE order is natural here)
          uint64 mask = read64LE(&bits[i / 8]);
        
          // Read l_list entries
          // leave rest zero-initialized for now
          uint8x64 result = zero_masked_vexpandb_from_mem(l_list, ~mask);

          // Read r_list entries
          result = masked_vexpandb_from_mem(result, mask, r_list);

          // Store the result to the output buffer
          vstore(&output[i], result);

          // Determine population count (number of 1 bits) in mask to
          // figure out how much to advance the r_list and l_list pointers by
          uint64 npop = popcount(mask);
          r_list += npop;
          l_list += 64 - npop;
      }
  }

There’s a bit of futzing around to load the masks and advance the pointers, but this is pseudocode for a functional AVX512_VBMI2 implementation of the basic merge algorithm. Not necessarily optimal, but I’m not going to get that far into the weeds in this post. It has three actual vector instructions – two loads and one store, the rest is logistics.

There’s not much of a lesson here beyond “this sure is easy when your machine has a ‘do the thing’ instruction”. That’s why I won’t be spending any more words on AVX512 versions of anything in this post; it’s just our high anchor to prove that yes, with a restrictive enough hardware target, this turns into basically nothing.

The one thing I’ll say is that the actual prefix summing turned almost invisible in this version. VPEXPANDB implicitly does a kind of prefix sum over the active lanes to figure out which bytes to route where, and then at the end we do a population count over all 64 mask bits we handled in one go to figure out how much to move the list pointers by. This is going to be how things work.

In essence, all I’ll be doing in the rest of this post is figure out ways to get as close as possible to the loop above without requiring AVX512_VBMI2. On the x86 side, that means we probably want to have some option around AVX2 (256-bit integer vectors) and another around the SSE4.2 feature level. We can do something efficient for slightly lower-end targets as well (SSSE3 is about as low as we can realistically go here), but these are minor variations. On the ARM side, I’ll mostly look at NEON/AdvSIMD; I’ll leave SVE and RVV to people with access to relevant hardware.

The basic approach: 8 bytes at a time

SSE4.2 and NEON-level vector registers are much narrower than AVX512s name-giving 512 bits. For now, for our lower-end targets, we’re stuck with registers fitting 128 bits (or 16 bytes) at a time. We’ll see how to make use of that full width later, but for now, let’s keep things straightforward so we can familiarize ourselves with our tools.

Our main workhorses will be PSHUFB (introduced with SSSE3) on the SSE side and TBL on the ARM side (I’m only looking at 64-bit ARM here, the relevant instructions also exist in 32-bit ARM, although only producing 8 bytes of output at a time, but I’m working mainly on 64-bit code these days). TBL gets up to four (consecutive) source registers that are interpreted as a table of bytes, another source register that specifies a byte vector of indices into said table, and writes the results of all those table lookups into the destination register. Which is to say, it’s a generic any-byte-to-any-byte shuffle operation. Out-of-range index values produce the number 0.

PSHUFB in x86 is quite similar, with a few differences: first, the “table” is always a single 128-bit vector register (16 bytes). No more, no less. Second, only the bottom 4 bits select the table index; bits 4 through 6 are ignored. Functionally, that means table addressing is mod 16. Third, setting the MSB (bit 7) in the index vector does result in a 0 byte being returned in that SIMD lane.

These properties mean that one-128b-vector TBL in AArch64 and PSHUFB in x86-64 are mostly interchangeable, as long as we avoid indices between 16 and 127 (inclusive) where their behavior differs.

If we process 8 bytes at a time, that also means we process 8 bits from the bitstream at a time, i.e. a single byte’s worth, which is convenient. 8 bits means 256 different combinations, which is small enough that we can reasonably precompute a shuffle index vector for all possible combinations and just store them in a table. That table will be 256*8 bytes = 2KiB. And we can speculatively grab the next 8 bytes from both l_list and r_list, combine them into a single 16-byte (128-bit) vector, and use TBL/PSHUFB to index into it. This results in something like this algorithm:

  void merge(uint8* output,
      const uint8* l_list, const uint8* r_list,
      const uint8* bits, usize count)
  {
      // 8 bytes per iteration now
      for (usize i = 0; i < count; i += 8)
      {
          // Load 1 byte's worth of bits from bitstream
          uint8 in_bits = bits[i / 8];

          // Look up the corresponding index vector
          uint8x16 index_vec = index_vec_table[in_bits];

          // Speculatively, read 8 bytes from both lists
          uint8x8 l_bytes = vload64(l_list);
          uint8x8 r_bytes = vload64(r_list);

          // Concatenate them into a single 128-bit vector
          uint8x16 lr_bytes = vcombine(l_bytes, r_bytes);

          // Shuffle/table lookup
          uint8x16 result = vshuffle(lr_bytes, index_vec);

          // Store the result to the output buffer
          vstore(&output[i], result);

          // Determine population count (number of 1 bits) in
          // mask to figure out how much to advance the r_list
          // and l_list pointers by
          usize npop = popcount(in_bits);
          r_list += npop;
          l_list += 8 - npop;
      }
  }

I’m not showing the code to generate the index vector table, but all you have to do is prepare an 8-entry L list containing the values 0 through 7, a corresponding 8-entry R list containing the values 8 through 15, and then use the scalar merge algorithm with all 256 possible 8-bit patterns. The output lists are the index vectors to use to control the shuffle instructions. This basic approach will work for all table-based techniques in this post; I won’t be calling it out explicitly every time.

Also, we might not have a 8-bit popcount instruction available, but a small 256-entry table with byte values will do in that case. Another 256 bytes added to our 2KiB table budget won’t break the bank.

This is straightforward enough, and there is no immediately obvious way to improve it further. Our 256-entry 2KiB table of precomputed index vector is fine. The next larger step up that divides neatly into the operation sizes we have available is to work 16 bits at a time. That would require a 65536-entry table with 16B per table entry (since we’re now producing 16 output bytes), for 1MiB worth of tables. A 2KiB table comfortably fits in any relevant CPU’s L1D cache; 1MiB won’t fit in anyone’s L1, and if every table access is a L1D miss (possibly L2 miss as well), this loop’s performance will take a sharp nosedive.

In short, a single table lookup to retrieve our index vector is completely practical for 8 items at a time, but if we want to get up to 16, we need something different.

Towards 16 bytes at a time

We have another problem: 64-bit ARM has 2-register TBL, which can index into a table of 32 bytes in one instruction, and 2-register TBL executes as a single uop on many (though not all) implementations, so this is not only available, but also efficient. PSHUFB on x86, however, really only exists as a 16-byte version. (AVX2 VPSHUFB does two parallel 128-bit PSHUFBs; table entries are always fetched from within the same 128-bit slice of the vector the corresponding index value lives in.)

We can synthesize wider PSHUFBs out of regular ones using extra instructions (multiple PSHUFBs into smaller sub-tables and then a branchless select to choose the right sub-table), but this is relatively expensive. It’s much better to just handle the L and R list bytes separately: while the combined L and R list candidate bytes sum up to 32, we can consume at most 16 bytes from either list. And PSHUFB lets us generate 0 bytes if we set the corresponding index MSB. That means we probably want to set up into something like

    uint8x16 l_result = vshuffle(l_bytes, l_index_vec);
    uint8x16 r_result = vshuffle(r_bytes, r_index_vec);
    uint8x16 result = vor(l_result, r_result);

But now we need to come up with two index vectors. Or do we? If we set the MSB of the index vector, it doesn’t matter what the other bits are, PSHUFB (and TBL) will always return 0. And the items coming from the L list are always exactly in the spots where the R list items are not. Therefore, we can use the same index vector for both. We choose one list (in the following, that will be the R list) to use regular shuffle indices, while the other will use shuffle indices with their MSB set. After we use that index vector for one list, we can flip all the index MSBs via exclusive-OR, which gets us the index vector to use for the other list. That yields

    uint8x16 r_result = vshuffle(r_bytes, index_vec);
    uint8x16 l_result = vshuffle(l_bytes, vxor(index_vec, 128));
    uint8x16 result = vor(r_result, l_result);

Cool. A single bitwise operation is much better than having to come up with two separate index vectors. Speaking of which, we now have a plausible way of generating 16 output bytes at a time from a suitable index vector, but where do we get that index vector from in the first place?

For the first half of the index vector, we can use the same kind of table we already have. And, in theory, we could use the same table again for the second half, with just one problem: the second half doesn’t start at the beginning of l_bytes or r_bytes, it has to start right where the first half left off.

That means that if the first byte had say 3 set bits, the index vector for the second half needs to add 3 to every index into r_bytes, and 8 – 3 = 5 to every index into l_bytes. A straightforward implementation of this idea would need to:

do a table lookup for the first half of the mask (8 lanes worth)
do another table lookup for the second half of the mask (another 8 lanes worth)
determine the population count for the bitstream byte corresponding to the first 8 vector lanes (probably some ANDing or shifting to isolate that byte, then a table lookup or native popcount instruction where available)
get this population count over to a vector register and create at least 8 copies of it
for the second-half-of-vector index vector, determine a bit mask saying whether a given byte comes from the L or R half (we can tell from the index MSB)
add the replicated population count to all the “R” lane indices for the second half (this will usually by a bitwise AND followed by an ADD)
calculate 8-vector_popcount
add 8-vector_popcount to all the “L” lanes indices for the second half (AND-NOT and ADD, or similar)
finally, combine the two index vector halves

That’s a lot of work. Working 16 bytes at a time is clearly worth something, but as written this is at least 8 instruction equivalents more than processing the data in 8-byte chunks. Sure, we save on some pointer manipulation and such, but on the vector side, this looks pretty dire. We need something better.

We could tune the above a bit to improve it slightly, but saving an instruction here and there is not going to cut it. What we really need is a radical simplification.

Here’s the elevator pitch: we need to add that popcount (which I’ll call pop0) to the R indices, and 8 - pop0 to the L indices, and it is that sign difference that really causes problems for us. What if the L indices were negated, though? Then we would have to add -(8 - pop0) to them, which is pop0 - 8, which means we would be adding popcount to every lane (the -8 bias portion, we’ll get to in a minute).

Negating the L values doesn’t quite work, because the L index can be 0, and we need to be able to tell from the MSB whether a lane comes from the R or L list to make our “final shuffle” logic work out. But what we can do is agree on the convention that a byte that is supposed to come from r_list[i] stores i in its location in the index vector, and bytes from l_list[i] store 255 - i = i ^ 255 as an index.

For the shuffle indices in the second half, we still need to add pop0 to the R indices to account for what happened in the first half, which is straightforward. For the L indices, we need 255 - (i + (8 - pop0)) = (255 - 8) - i + pop0.

We still expect to get the shuffle indices for the second half from a table. If we make a separate table for that second half where the precomputed L indices are (255 - 8) - i not 255 - i (R remains the same), this doesn’t cost us any extra computation in the loop. And finally, note that pop0 only depends on the first mask byte that we already use for the first-half table lookup. If we make those table entries store 16 bytes not 8, we can just store 8 copies of pop0 in the second half of every table entry. Put it all together, and we end up with this pseudocode for our 16-bytes-at-a-time loop:

  void merge(uint8* output,
      const uint8* l_list, const uint8* r_list,
      const uint8* bits, usize count)
  {
      for (usize i = 0; i < count; i += 16)
      {
          // Load 2 byte's worth of bits from the bitstream
          uint16 in_bits = read16LE(&bits[i / 8]);

          // Two table lookups
          uint8x16 index0 = index_vec_tab0[in_bits & 0xff];
          uint8x8 index1 = index_vec_tab1[in_bits >> 8];

          // index1 needs to go into the second half of the 128-bit vector
          uint8x16 index1_wide = vcombine(vzero_uint8x8(), index1);

          // Add the two values together to perform the pop count adjustment
          uint8x16 index_vec = vadd(index0, index1_wide);

          // Speculatively, read the next 16 bytes from both l_list and r_list
          uint8x16 l_bytes = vload128(l_list);
          uint8x16 r_bytes = vload128(r_list);

          // Shuffle bytes to combine
          uint8x16 r_result = vshuffle(r_bytes, index_vec);
          uint8x16 l_result = vshuffle(l_bytes, vnot(index_vec));
          uint8x16 result = vor(r_result, l_result);        

          // Store the result to the output buffer
          vstore(&output[i], result);

          // Determine population count (number of 1 bits) in mask to
          // figure out how much to advance the r_list and l_list pointers by
          usize npop = popcount(in_bits);
          r_list += npop;
          l_list += 16 - npop;
      }
  }

As mentioned in the text, index_vec_tab0 has 256 16-byte entries, where the first 8 bytes of every table entry are the index permutation from doing a scalar merge of r_list = { 0, 1, ... , 7 } and l_list = { 255 - 0, 255 - 1, ..., 255 - 7 }. The second 8 bytes are just the population count of the table index repeated 8 times.

index_vec_tab1 as accessed in this code has 256 8-byte entries corresponding to the scalar merge results with the same r_list as before and a modified l_list = { 247 - 0, 247 - 1, ..., 247 - 7 }. So this version needs around 6KiB of tables, more than before but still a reasonably comfortable amount.

If you don’t want that “combine” operation (which usually itself turns into some kind of vector shuffle internal to the CPU), you can also make the second table have 16-byte entries where every entry has 8 bytes of zeros followed by the actual shuffle indices (now properly aligned in the second half). That means you save some shuffling but add another 2KiB of table size, and, it turns out, some extra uops from address computation. Both x86 and AArch64 implementations can usually multiply indices by 8 “for free” as part of load address generation, but not by 16, so this doesn’t normally save an instruction, it just trades a vector shuffle/shift operation for an address generation operation. In practice they’re usually close, but the smaller load tends to do a bit better, and between that and the 2KiB smaller table size, it’s the variant I prefer.

For a SSSE3/SSE4.2 target, this general approach is the best I’ve managed to come up so far. The exactly implementation details get a bit tweaky, and you might want to unroll some more to save on loop overhead, but this is the gist of it.

AVX2 still does, basically, this – now just on two 128-bit vectors at once. Due to the way AVX2 works (crossing 128-bit boundaries is hard and relatively expensive), this is your best bet. In effect, it’s functionally mostly an unrolled version of the 128-bit approach. Building the index vectors is really a 128-bit vector operation (arguably, it’s really a 64-bit vector operation and we’re already combining two 64-bit halves into a 128-bit vector in the computation above, although it may not look like it), and in AVX2 we do all that work twice, then pay an extra shuffle to merge the two 128-bit halves into a 256-bit register.

Likewise, we end up having to load l_list and r_list as two separate 128-bit halves, which is also no real win over working with 128-bit vectors. The only real 256-bit vector operations are in the “shuffle bytes to combine” and “store” stage. As a result, AVX2 is a bit faster than SSE4.2, but not by anything like 2x. It’s closer to 1.2x for me, and even that much took a lot of care.

I’ve looked into other approaches for AVX2 that don’t use any big tables at all and compute the shuffle vectors instead. These approach are more “naturally 256-bit” if that makes sense, and they do have better vector width utilization, but doing it this way also takes several extra instructions, and while I managed to get fairly close to the performance of the table-based approach, I never managed to beat it. I might write it up at some point.

Ironically, I am positive that approach would easily come out ahead once you step up to 512-bit vectors, but there’s just not much reason to. With AVX512_VBMI2, we have a full hardware solution that we’re never going to beat in software, and “AVX512 but not VBMI2” is a relatively small niche at this point of almost exclusively server CPUs that are 6+ years old and tend to drastically down-clock when 512-bit vector code is used anyway. Maybe some day? But I haven’t spent much though on it.

The NEON special

The algorithm as written above works just fine on NEON, and its performance is good. There’s nothing wrong with it.

It’s just that it happens we can do a bit better on machines where 2-table 128-bit TBL is fast (which includes Apple Silicon and newer high-end ARM Cortex/Neoverse cores). Our load-shuffle-and-store portion can be just this (and this time I’m using straight NEON intrinsics because this is actually NEON-specific):

    // Load source bytes
    uint8x16x2_t src;
    src.val[0] = vld1q_u8(r_list);
    src.val[1] = vld1q_u8(l_list);

    // Shuffle to form result
    uint8x16_t result = vqtbl2q_u8(src, index_vec);

    // store results
    vst1q_u8(dest, result);

Okay, that part is very nice, but how do we get the index vectors for this? We still want fetches from r_list to use positive indices and fetches from l_list to use negative indices so that our “add pop0 to everything” gambit continues to work. But, if we’re using 2-vector TBL, we also need the final l_list indices to be 16-based, because the l_list bytes go into indices 16 through 31 of the table.

So, this time, our index vectors are actual int8s, and store either i for a byte coming from r_list[i] or -(16 + i) for an entry coming from l_list[i]. We can just use this convention, and keep the table setup otherwise the same as in the previous step. Right before we do the shuffle, we want to negate our table indices for L list entries, which boils down to taking the absolute value of everything. That’ll leave the (positive) r_list indices alone and negate the (currently negative) l_list indices like we need.

Therefore, if we did that, we would end up with something like

    // Load index vector parts
    int8x16_t index0 = vld1q_s8(index_vec_tab0[in_bits & 0xff]);
    int8x16_t index1 = vld1q_s8(index_vec_tab1[in_bits >> 8]);

    // Form merged control
    int8x16_t summed = vaddq_s8(index0, index1);
    uint8x16_t index_vec = vreinterpretq_u8_s8(vabsq_s8(summed));

This is assuming we use the variant with full 128-bit table entries. As before, we can make the second table have 64-bit entries, making it smaller but taking one more vector shuffle-type operation at runtime (usually vextq).

The final trick is realizing that NEON has an instruction for |a[i] – b[i]| for signed values, and what we need is |a[i] + b[i]|. But b[i] comes out of a table that we control. We can just negate that entire table, and now the final index vector is two loads and a single arithmetic instruction, giving this full merge operation:

    // Load index vector parts
    int8x16_t index0 = vld1q_s8(index_vec_tab0[in_bits & 0xff]);
    int8x16_t index1 = vld1q_s8(index_vec_tab1[in_bits >> 8]);

    // Form merged control
    uint8x16_t index_vec = vreinterpretq_u8_s8(vabdq_s8(index0, index1));
    
    // Load source bytes
    uint8x16x2_t src;
    src.val[0] = vld1q_u8(r_list);
    src.val[1] = vld1q_u8(l_list);

    // Shuffle to form result
    uint8x16_t result = vqtbl2q_u8(src, index_vec);

    // store results
    vst1q_u8(dest, result);

I think this 16-byte merge operation might actually be capital-O Optimal on current NEON implementations. The loads from the two lists and the store are strictly necessary; using a single 1-uop (on many uArchs) instruction to do the byte shuffling is the minimum amount of work we can do once we have those source vectors; I’ve explained above that a 1MiB LUT from full 16-bit indices is not practical; if we can’t do it in a single table load, it has to be at least two; and if we have two separate table loads, there needs to be at least one instruction to combine them.

There is some uncertainty around the rest of the loop (figuring out the popcounts, the pointer advancing, and such), the “control plane” so to speak, but I think the “data plane” side of this operation is airtight.

And I think that’s it for today! I won’t be linking to any code of my own for this, since I’ve been talking to Marcin, he’s tried the things I described in this post out in his GitHub repository for PivCo-Huffman and confirmed they’re faster in his tests, so just go there if you want to see this actually implemented. They might not be up immediately depending on when you read this, but they should land shortly.