使用浮点数作为哈希键 (2017)
Using floating point numbers as hash keys (2017)

原始链接: https://readafterwrite.wordpress.com/2017/03/23/how-to-hash-floating-point-numbers/

## 浮点数作为哈希键:摘要 虽然字符串和整数是典型的哈希表键,但任何类型都可以通过合适的哈希函数使用。本文探讨了在Lua、Python、JavaScript和C++等语言中使用浮点数作为键的挑战。 核心问题是浮点数的精度问题。舍入误差可能导致相同的逻辑值具有不同的表示形式,从而阻碍可靠的索引。标准的相等比较(`==`)也存在问题,因为存在NaN(非数字)等特殊值,NaN != NaN,导致插入后无法检索。 定义一个适用于浮点数的良好哈希函数很困难。它需要将相等的值映射到相同的哈希值,同时最大限度地减少不同值的冲突。近似比较(例如使用epsilon)会失效,因为它们会迫使许多不同的数字共享相同的哈希值。 然而,文章指出,如果相等性得到可靠的确认,则将浮点数用作键是安全的。将浮点数作为主要数值类型的语言必须解决这些问题,特别是关于+0.0/-0.0和NaN值,这些值需要在哈希函数实现中进行特殊处理。未来的文章将详细介绍不同语言如何应对这些复杂性。

这个Hacker News讨论围绕着将浮点数用作哈希键的挑战。最初的帖子链接到一篇文章,探讨了为什么为近似相等的浮点数定义哈希函数存在问题。 一个关键点是,常见的“epsilon比较”(在很小的差异范围内认为数字相等)会导致所有数字最终哈希到相同的值,从而失去了哈希函数的目的。一位评论员认为,这仅仅证明了*直接*将epsilon比较应用于哈希是错误的,而不是所有浮点数哈希都是不可能的。 进一步的讨论表明,Java通过在哈希*之前*将浮点数转换为表示其字节值的整数(并进行NaN标准化)来处理这个问题,从而避免量化。JavaScript使用“SameValueZero”进行相等性判断,区分-0和0。这次对话突出了近似浮点数比较与哈希函数精确性质之间的内在矛盾。
相关文章

原文

Strings and integers are the most common keys for hash tables, but in principle every type can be used as a hash key, provided we define a suitable comparison operator and hash function. Most languages that support hash tables also provide hash functions for all built-in types, including floating point numbers. In this series of articles I will take a look at how languages like Lua, Python, JavaScript, and C++ handle floating point keys. It turns out that there are significant differences between these languages, both in the way they handle special values like NaN and in the way they define their hash functions.

This article addresses a few general questions: what is the difference between floating point numbers and other types of hash keys? What are common pitfalls when defining hash functions for floats? How do we handle special values such as minus zero, infinity, and NaN?

Floating point keys aren’t widely used and a common recommendation is to avoid them altogether. The main argument against floating point numbers as keys is that they are “inaccurate” and therefore unsuitable for indexing hash tables. This is, of course, a common problem with floating point numbers, and it shouldn’t be surprising that rounding and truncation errors can lead to unexpected results when they affect table keys:

>>> table = {}
>>> table[0.3] = "Hello"
>>> table[0.1 + 0.2] = "World"
>>> table
{0.30000000000000004: 'World', 0.3: 'Hello'}

In general, floating point numbers as hash keys is error-prone for the same reason comparing them for equality is error-prone. There are many ways to compare floating point numbers, but unfortunately most of them are useless for hash tables: it’s easy to define different notions of “approximately equal”, but it’s hard or impossible to define hash functions that map approximately equal keys to identical hash values. Why? Let’s take a look at what happens with the common *epsilon comparison*, for example: two numbers a, b are considered equal if |a – b| < ε. With this definition of equality all numbers between -ε/2; and ε/2 are considered equal and therefore must have the same hash value h. But the numbers between 0 and ε are also equal to each other, so their hash value must be h as well. If we continue in this manner we see that all all numbers must have the same hash value h regardless of the choice of ε. The idea of approximate comparison is unfortunately hard to reconcile with the non-approximate nature of hash functions.

But we can also turn the argument from previous paragraph around: comparing two floating point numbers for equality may be error-prone in general, but in situation where it is safe, it’s also safe to use the numbers as hash table keys. After all, there’s nothing inaccurate about floating point numbers or comparison operations themselves: a number like 1010.101112 · 217 has a precise value, and comparing it to other floats is a well-defined operation. It’s only certain operations like arithmetic or decimal-to-binary conversions that introduce rounding errors. And not even all arithmetic operations on floats are inaccurate. For example, the following equations are always true:

>>> 1.0 + 2.0 == 3.0
True
>>> 0.5 * 2 == 1.0
True

In some scripting languages like Javascript or older versions of Lua, floats are the only numerical type provided by the language, so the language has to deal with floating point keys at some level.

How do you define a hash function for floating point numbers? Here are the basic requirements:

  1. Values x should be mapped to unsigned integers hash(x) that can be used to index the hash table.
  2. If x == y then hash(x) == hash(y) so that we can find keys in the hash table after inserting them.
  3. If x != y then hash(x) != hash(y) with high probability so that there are few collisions and the hash table performs efficiently.

The second condition is the one that’s most interesting because it relates the hash function to the equality operator used by the hash table. This is especially relevant for floating point numbers where the standard == operator has a few special cases that must be taken into account:

  • There are two distinct floating point zeros, +0.0 and -0.0. In general they should map to the same hash value.
  • Especially in dynamically typed languages, floating point numbers may be just one representation for a more general “number” type. In this case, we may want to map all numbers that are numerically equal (e.g., 1.0f, 1.0, and 1) to the same hash value.

NaN values pose a special problem because basically all languages follow the IEEE convention that NaN != NaN. In the context of hash tables this means that that if you insert a NaN key into a hash table, you can never find it using == since the comparison always returns false! Likewise, it becomes possible to insert an arbitrary number of NaN keys into a hash table, as the following Python example demonstrates:

>>> table = {}
>>> table[float('NaN')] = True
>>> table[float('NaN')] = False
>>> table
{nan: True, nan: False}
>>> float('NaN') in table
False

In the next few posts we will see how different programming languages handle all these issues.

联系我们 contact @ memedata.com