我最喜欢的小哈希表

我最喜欢的小哈希表
My favourite small hash table

原始链接: https://www.corsix.org/content/my-favourite-small-hash-table

该文本详细介绍了一种令人惊讶的高效哈希表设计，使用开放寻址和线性探测，并采用2的幂次方作为表大小——这种方法被称作“可爱”且鲜为人知。它将键值对存储为64位整数（32位用于键，32位用于值），允许使用零来表示空槽，从而无需使用墓碑。核心创新是“罗宾汉”策略：在插入期间，如果发生冲突，算法优先置换距离其理想（哈希）位置更远的键，以确保平衡的分布。查找操作在找到空槽或遇到“分数”较低的键（基于距离其理想位置的距离）时有效终止。该设计针对64位架构进行了优化，利用了高效的位运算指令。它包括查找、插入（带返回值）、删除（无需墓碑）和迭代等函数。讨论了针对非随机键分布（使用哈希）和更大的键/值大小的调整，但并发和SIMD优化不在其范围内。最终，作者提倡这种设计作为许多用例的实用且高性能的解决方案。

## 小型哈希表讨论总结一个Hacker News讨论围绕一个小型、高效的哈希表实现（[corsix.org](https://corsix.org)）。该表采用线性数组方法，利用高达32GiB的内存存储40亿个值，其中键充当数组索引。这种设计最大限度地减少了指针追逐，从而最大限度地提高CPU缓存效率。用户强调了简单布局和紧凑数据结构对性能的重要性，并指出诸如位交错之类的优化可以提高吞吐量（在一种情况下，提高了100倍）。许多评论者分享了从看似微小的代码调整中获得意外性能提升的轶事，强调了分析器的局限性。该实现使用`uint64_t`存储键和值，主要是为了有效地与零进行比较以识别空槽，尽管也讨论了结构体或联合体等替代方案。该哈希表实现的是多重集合，而不仅仅是集合。讨论还涉及将变长字符串与机器字键进行哈希处理的挑战。

I'm the kind of person who thinks about the design and implementation of hash tables. One design which I find particularly cute, and I think deserves a bit more publicity, is Robin Hood open-addressing with linear probing and power-of-two table size. If you're not familiar with hash table terminology, that might look like a smorgasbord of random words, but it should become clearer as we look at some actual code.

To keep the code simple to start with, I'm going to assume:

Keys are randomly-distributed 32-bit integers.
Values are also 32-bit integers.
If the key 0 is present, its value is not 0.
The table occupies at most 32 GiB of memory.

Each slot in the table is either empty, or holds a key and a value. The combination of properties (1) and (2) allows a key/value pair to be stored as a 64-bit integer, and property (3) means that the 64-bit value 0 can be used to represent an empty slot (some hash table designs also need a special value for representing tombstones, but this design doesn't need tombstones). Combining a key and a value into 64 bits couldn't be easier: the low 32 bits hold the key, and the high 32 bits hold the value.

The structure for the table itself needs a pointer to the array of slots, the length of said array, and the number of non-empty slots. As the length is always a power of two, it's more useful to store length - 1 instead of length, which leads to mask rather than length, and property (4) means that mask can be stored as 32 bits. As the load factor should be less than 100%, we can assume count < length, and hence count can also be 32 bits. This leads to a mundane-looking:

struct hash_table_t {
  uint64_t* slots;
  uint32_t mask;
  uint32_t count;
};

Property (1) means that we don't need to hash keys, as they're already randomly distributed. Every possible key K has a "natural position" in the slots array, which is just K & mask. If there are collisions, the slot in which a key actually ends up might be different to its natural position. The "linear probing" part of the design means that if K cannot be in its natural position, the next slot to be considered is (K + 1) & mask, and if not that slot then (K + 2) & mask, then (K + 3) & mask, and so on. This leads to the definition of a "chain": if K is some key present in the table, C_K denotes the sequence of slots starting with K's natural position and ending with K's actual position. We have the usual property of open-addressing: none of the slots in C_K are empty slots. The "Robin Hood" part of the design then imposes an additional rather interesting property: for each slot S in C_K, Score(S.Index, S.Key) ≥ Score(S.Index, K), where:

S.Index is the index of S in the slots array (not the index of it in C_K).
S.Key is the key present in slot S (i.e. the low 32 bits of slots[S.Index]).
Score(Index, Key) is (Index - Key) & mask.

These properties give us the termination conditions for the lookup algorithm: for a possible key K, we look at each slot starting from K's natural position, and either we find K, or we find an empty slot, or we find a slot with Score(S.Index, S.Key) < Score(S.Index, K). In either of the latter two cases, K cannot have been present in the table. In the function below, Score(S.Index, K) is tracked as d. In a language with a modern type system, the result of a lookup would be Optional<Value>, but if sticking to plain C, property (3) can be used to make something similar: the 64-bit result is zero if the key is absent, and otherwise the value is in the low 32 bits of the result (which may themselves be zero, but the full 64-bit result will be non-zero). The logic is thus:

uint64_t table_lookup(hash_table_t* table, uint32_t key) {
  uint32_t mask = table->mask;
  uint64_t* slots = table->slots;
  for (uint32_t d = 0;; ++d) {
    uint32_t idx = (key + d) & mask;
    uint64_t slot = slots[idx];
    if (slot == 0) {
      return 0;
    } else if (key == (uint32_t)slot) {
      return (slot >> 32) | (slot << 32);
    } else if (((idx - (uint32_t)slot) & mask) < d) {
      return 0;
    }
  }
}

If using a rich 64-bit CPU architecture, many of the expressions in the above function are cheaper than they might initially seem:

slots[idx] involves zero-extending idx from 32 bits to 64, multiplying it by sizeof(uint64_t), adding it to slots, and then loading from that address. All this is a single instruction on x86-64 or arm64.
key == (uint32_t)slot involves a comparison using the low 32 bits of a 64-bit register, which is a completely standard operation on x86-64 or arm64.
(slot >> 32) | (slot << 32) is a rotation by 32 bits, which again is a single instruction on x86-64 or arm64.

On the other hand, if using riscv64, things are less good:

If the Zba extension is present, sh3add.uw is a single instruction for zero-extending idx from 32 bits to 64, multiplying it by sizeof(uint64_t), and adding it to slots. If not, each step is a separate instruction, though the zero-extension can be eliminated with a slight reformulation to encourage the compiler to fold the zero-extension onto the load of table->mask (as riscv64 usually defaults to making sign-extension free, in contrast to x86-64/arm64 which usually make zero-extension free). Regardless, the load is always its own instruction.
key == (uint32_t)slot hits a gap in the riscv64 ISA: it doesn't have any 32-bit comparison instructions, so this either becomes a 32-bit subtraction followed by a 64-bit comparison against zero, or promotion of both operands from 32 bits to 64 bits followed by a 64-bit comparison.
If the Zbb extension is present, rotations are a single instruction. If not, they're three instructions, and so it becomes almost worth reworking the slot layout to put the key in the high 32 bits and the value in the low 32 bits.

Moving on from lookup to insertion, there are various different options for what to do when the key being inserted is already present. I'm choosing to show a variant which returns the old value (in the same form as table_lookup returns) and then overwrites with the new value, though other variants are obviously possible. The logic follows the same overall structure as seen in table_lookup:

uint64_t table_set(hash_table_t* table, uint32_t key, uint32_t val) {
  uint32_t mask = table->mask;
  uint64_t* slots = table->slots;
  uint64_t kv = key + ((uint64_t)val << 32);
  for (uint32_t d = 0;; ++d) {
    uint32_t idx = ((uint32_t)kv + d) & mask;
    uint64_t slot = slots[idx];
    if (slot == 0) {
      
      slots[idx] = kv;
      break;
    } else if ((uint32_t)kv == (uint32_t)slot) {
      
      slots[idx] = kv;
      return (slot >> 32) | (slot << 32);
    } else {
      uint32_t d2 = (idx - (uint32_t)slot) & mask;
      if (d2 < d) {
        
        slots[idx] = kv;
        table_reinsert(slots, mask, slot, d2);
        break;
      }
    }
  }
  if (++table->count * 4ull >= mask * 3ull) {
    
    table_rehash(table);
  }
  return 0;
}

To avoid the load factor becoming too high, the above function will sometimes grow the table by calling this helper function:

void table_rehash(hash_table_t* table) {
  uint32_t old_mask = table->mask;
  uint32_t new_mask = old_mask * 2u + 1u;
  uint64_t* new_slots = calloc(new_mask + 1ull, sizeof(uint64_t));
  uint64_t* old_slots = table->slots;
  uint32_t idx = 0;
  do {
    uint64_t slot = old_slots[idx];
    if (slot != 0) {
      table_reinsert(new_slots, new_mask, slot, 0);
    }
  } while (idx++ != old_mask);
  table->slots = new_slots;
  table->mask = new_mask;
  free(old_slots);
}

Both of table_set and table_rehash make use of a helper function which is very similar to table_set, but doesn't need to check for overwriting an existing key and also doesn't need to update count:

void table_reinsert(uint64_t* slots, uint32_t mask, uint64_t kv, uint32_t d) {
  for (;; ++d) {
    uint32_t idx = ((uint32_t)kv + d) & mask;
    uint64_t slot = slots[idx];
    if (slot == 0) {
      slots[idx] = kv;
      break;
    } else {
      uint32_t d2 = (idx - (uint32_t)slot) & mask;
      if (d2 < d) {
        slots[idx] = kv;
        kv = slot;
        d = d2;
      }
    }
  }
}

That covers lookup and insertion, so next up is key removal. As already hinted at, this hash table design doesn't need tombstones. Instead, removing a key involves finding the slot containing that key and then shifting slots left until finding an empty slot or a slot with Score(S.Index, S.Key) == 0. This removal strategy works due to a neat pair of emergent properties:

If slot S has Score(S.Index, S.Key) != 0, it is viable for S.Key to instead be at (S.Index - 1) & mask (possibly subject to additional re-arranging to fill the gap formed by moving S.Key).
If slot S has Score(S.Index, S.Key) == 0, and S is part of some chain C_K, then S is at the very start of C_K. Hence it is viable to turn (S.Index - 1) & mask into an empty slot without breaking any chains.

This leads to the tombstone-free removal function, which follows the established pattern of returning either the old value or zero:

uint64_t table_remove(hash_table_t* table, uint32_t key) {
  uint32_t mask = table->mask;
  uint64_t* slots = table->slots;
  for (uint32_t d = 0;; ++d) {
    uint32_t idx = (key + d) & mask;
    uint64_t slot = slots[idx];
    if (slot == 0) {
      return 0;
    } else if (key == (uint32_t)slot) {
      uint32_t nxt = (idx + 1) & mask;
      --table->count;
      while (slots[nxt] && ((slots[nxt] ^ nxt) & mask)) {
        slots[idx] = slots[nxt];
        idx = nxt;
        nxt = (idx + 1) & mask;
      }
      slots[idx] = 0;
      return (slot >> 32) | (slot << 32);
    } else if (((idx - (uint32_t)slot) & mask) < d) {
      return 0;
    }
  }
}

The final interesting hash table operation is iterating over all keys and values, which is just an array iteration combined with filtering out zeroes:

void table_iterate(hash_table_t* table, void(*visit)(uint32_t key, uint32_t val)) {
  uint64_t* slots = table->slots;
  uint32_t mask = table->mask;
  uint32_t idx = 0;
  do {
    uint64_t slot = slots[idx];
    if (slot != 0) {
      visit((uint32_t)slot, (uint32_t)(slot >> 32));
    }
  } while (idx++ != mask);
}

That wraps up the core concepts of this hash table, so now it is time to revisit some of the initial simplifications.

If keys are 32-bit integers but are not randomly-distributed, then we just need an invertible hash function from 32 bits to 32 bits, the purpose of which is to take keys following ~any real-world pattern and emit a ~random pattern. The table_lookup, table_set, and table_remove functions gain key = hash(key) at the very start but are otherwise unmodified (noting that if the hash function is invertible, hash equality implies key equality, hence no need to explicitly check key equality), and table_iterate is modified to apply the inverse function before calling visit. If hardware CRC32 / CRC32C instructions are present (as is the case on sufficiently modern x86-64 and arm64 chips), these can be used for the task, although their inverses are annoying to compute, so perhaps not ideal if iteration is an important operation. If CRC32 isn't viable, one option out of many is:

uint32_t u32_hash(uint32_t h) {
  h ^= h >> 16;
  h *= 0x21f0aaad;
  h ^= h >> 15;
  h *= 0x735a2d97;
  h ^= h >> 15;
  return h;
}

uint32_t u32_unhash(uint32_t h) {
  h ^= h >> 15; h ^= h >> 30;
  h *= 0x97132227;
  h ^= h >> 15; h ^= h >> 30;
  h *= 0x333c4925;
  h ^= h >> 16;
  return h;
}

If keys and values are larger than 32 bits, then the design can be augmented with a separate array of key/value pairs, with the design as shown containing a 32-bit hash of the key and the array index of the key/value pair. To meet property (3) in this case, either the hash function can be chosen to never be zero, or "array index plus one" can be stored rather than "array index". It is not possible to make the hash function invertible in this case, so table_lookup, table_set, and table_remove do need extending to check for key equality after confirming hash equality. Iteration involves walking the separate array of key/value pairs rather than the hash structure, which has the added benefit of iteration order being related to insertion order rather than hash order. As another twist on this, if keys and values are variably-sized, then the design can instead be augmented with a separate array of bytes, with key/value pairs serialised somewhere in that array, and the hash structure containing a 32-bit hash of the key and the byte offset (within the array) of the key/value pair.

Of course, a design can only stretch so far. If you're after a concurrent lock-free hash table, look elsewhere. If you can rely on 128-bit SIMD instructions being present, you might instead want to group together every 16 key/value pairs, keep an 8-bit hash of each key, and rely on SIMD to perform 16 hash comparisons in parallel. If you're building hardware rather than software, it can be appealing to have multiple hash functions, each one addressing its own SRAM bank. There is no one-size-fits-all hash table, but I've found the one shown here to be good for a lot of what I do.

我最喜欢的小哈希表 My favourite small hash table

我最喜欢的小哈希表
My favourite small hash table