切分、存储、锁定 —— 哈希函数的故事

切分、存储、锁定 —— 哈希函数的故事
Chopped, Stored, Secured – The Story of the Hash Function

原始链接: https://0xkrt26.github.io/math_behind_security/2026/06/09/the-story-of-the-hash-function.html

哈希（Hashing）的概念最早由阿诺德·杜梅（Arnold Dumey）于1956年提出，它利用数学变换将数据映射到特定的内存地址，从而优化搜索效率。该过程最初被称为“索引”，其原理是将数据转换为数值（例如37进制），并应用模运算来确定其存储位置。早期的哈希主要用于数据检索，而现代的“加密哈希”则用于保护信息。通过利用“单向函数”（即易于计算但难以逆向还原的数学运算），网站可以安全地存储密码的哈希值而非明文。这些函数通常利用有限域上的高次多项式或离散幂运算，确保还原输出所需的时间和计算能力达到不可行的程度。至关重要的是，安全的哈希必须具备抗碰撞性，以确保不同的输入能产生唯一的输出。虽然这些方法目前为数字签名和区块链等应用提供了稳健的安全性，但量子计算的兴起使得开发后量子密码标准变得至关重要，以确保在未来更强大的解密能力面前，数据依然保持完整性。

这篇 Hacker News 的讨论批评了一篇题为《切割、存储、加密——哈希函数的故事》的文章，称其技术论断大体不准确。用户 **tptacek** 阐明了 SHA-2 等现代哈希函数的架构，解释称它们本质上是分组密码的置换核心。他指出，哈希函数和分组密码共享相同的基本“组件”，甚至可以使用哈希函数来推导密码。用户 **thequux** 驳斥了文章中的几个具体技术点，指出： * 加密哈希函数并不使用有限域幂运算或高次多项式。 * 密码哈希需要带有盐值的专用密钥派生函数（KDF），而非简单的哈希函数。 * 文章关于量子计算的论断具有误导性；尽管 Grover 算法在理论上会影响安全级别，但 SHA-256 在面对量子威胁时在计算上依然是安全的。总的来说，评论者提醒读者，该文章包含重大的技术错误，应持怀疑态度阅读。

原文

A Python implementation for all of these algorithms is available in my GitHub repository.

Peter Luhn came up with the idea of using math to verify and store information, Birthday problem explains the collisions in Hash tables, Rabin-Karp algorithm uses rolling hash to search strings. We’ve talked so much about hash, yet not a single time was the definition of the hash function itself provided.

When did hashing first appear?

In 1956 the idea of hashing was for the first time defined by Arnold Dumey. His passion for cryptography since 14 and mathematics degree from Columbia University brought him first to the US Army Signal Corps, then to Potter Instrument (printer producing company also interested in engineering of computer memory). Later in his interview for the Charles Babbage Institute Dumey said: “I wrote a paper, which was the first paper on hash coding, based on the work I did there [at Potter Instrument]”. In that paper, Dumey described the concept of using a mathematical transformation to map data to memory addresses for faster search and called it indexing.

How did indexing work?

In his paper, Dumey gave clear instructions.

1. To store a word in memory, first transform it into a number.

Let’s take the word “BOX” as an example. We can convert it, as Dumey suggests, into a base 37 number or just use ASCII. We will go with the first option. To do that, we map letters to their alphabetical positions

	A=1
	B=2
	...
	X=24
	Y=25
	Z=26

Then we multiply them by powers of 37.

In our example, the numeric value of the word “BOX” is

\[2 \times 37^2 + 15 \times 37^1 + 24 \times 37^0 = 2738 + 555 + 24 = 3317\]

Early computer systems and punched tapes primarily dealt with letters and numbers only: 26 letters of the English alphabet, plus 10 numbers, and a space add up to 37 available symbols.

2. Find a prime number slightly less than the number of addressable locations.

In our example, we will have 100 addressable locations. In this case, the nearest prime number is 97.

3. Divide the numeric value by the prime number…

…then throw away the quotient and use the remainder. In other words, perform our favorite modulo operation!

\[3317 \pmod{97} = 19\]

This means the word “BOX” will be stored at memory address 19.

What’s the difference between indexing and hashing?

There’s none. Later indexing will be called a polynomial hash, which we’ve already covered in Rabin-Karp algorithm a few weeks ago. The word hash itself originates in the French word hacher (“to chop”, “to mince”) which pretty much reflects what a hash function does with information before storing it in memory. For some time it was just a widespread jargon among cryptographers. In the late 1960s, the term hash appeared officially for the first time in Herbert Hellerman’s “Digital Computer System Principles”.

So that’s how cryptographic hashes work?

Not really. The idea is the same: use mathematics to transform data. However, the goal now is to also make that data protected from attackers.

The definition of the cryptographic hash that I'll give here is based on a very interesting paper, "New Directions in Cryptography"(1976) by Whitfield Diffie and Martin E. Hellman. It's engaging, easy to read, and covers not only cryptographic hashing but also RSA, so I can really recommend it for further readings.

So what is a cryptographic hash?

Hash is a value used to secure vulnerable data. For example, websites store your login information as a hash, not as plain text, to protect it from being stolen by an attacker. When you register, you enter the password (PW) for the first time. The website produces a hash by using a one-way function f(x), also called a hash function, and stores f(PW). For each of your further logins, the website will calculate f(x) with the data you enter. Only if f(x) equals f(PW) will the password be accepted.

Since your password can still be stolen on its way from your computer to the website, password hashing is usually used together with encryption protocols like TLS to protect your data in transit.

Why is storing f(PW) safer than storing PW?

The main property of the one-way function that creates a hash is its irreversibility. It’s computationally impossible to go back from hash to plain text even if you have the function used for the transformation.

Computationally means that even with the most powerful computer, it will take ages to find the original value.

Why would it be so hard to reverse a hash function?

At school we learned that every function has an inverse function: addition is reversed by subtraction, multiplication by division, an exponent by a logarithm. However, for some functions, there’s simply no known “undo button”. Such functions are called preimage resistant. Let’s look at some examples.

1. High-degree polynomial

The polynomial of degree n looks like this:

\[p(x) = a_n x^n + a_{n-1} x^{n-1} + \dots + a_1 x + a_0\]

A computer can easily calculate p(x) for any value of x by just doing linear additions and multiplications. However, to solve p(x)=y for x is much harder. We all have solved polynomials of first, second degree, maybe even third degree using quadratic and binomial formulas. But what about a fifth degree polynomial? Tenth? According to the Abel-Ruffini Theorem, “The general polynomial of degree n is not solvable by radicals for n ≥ 5”, meaning that there’re no shortcuts or special formulas for solving a high degree polynomial. Therefore, it leaves the attacker no choice but to use iterative algorithms, basically guessing the roots.

But even this is not enough. These high-degree polynomials have to be over a finite field Fp, which means after each operation a modulo of a large prime number is taken from the intermediary result. (Why “a big prime number” is explained here)

2. Discrete Exponentiation.

In other words, once again our favorite modulo operation performed on each step of bringing a to the power of x:

\[y = a^x \pmod{q}\]

This function is also easily calculated for each x. However, its reverse, logarithm over a finite field is again a hard task for an attacker.

Why do these functions have to be over a finite field, and what is a finite field?

Finite field Fp (or Galois field GF(p)) is a field with a finite amount of elements. This amount must equal the power of a prime number. For example, a finite field of 7 can only have values 0, 1, 2, 3, 4, 5, 6. No values in between like 5.99, only integers. In comparison, field R (Real Numbers) includes infinite number of elements.

In school algebra, we got used to working with infinite fields, where the graph is a curve that can be analyzed, traced, and predicted. Finite fields can be imagined more like chaotic dots, where x for y=101 and x for y=100 aren’t even close values. So instead of a “hot and cold” game, the attacker is left with no choice but to brute force the function. It is a very, very, very long process to check, let’s say, 2^256 inputs.

Are these functions always secure?

The fewer collisions there’re, the better. Imagine there’s another X that gives f(X)=f(PW). In this case it doesn’t matter which value the attacker finds, X or PW. Since f(X)=f(PW), he will still successfully pass the login.

Another problem has emerged recently. Remember how we said that it was computationally impossible to reverse the one-way function? Not anymore. Quantum computers exceed the limits of modern computers, making brute force much faster. This creates a need for post-quantum cryptography.

What else can hash be used for?

Other applications of a cryptographically secure hash function include digital signatures for document verification and blockchain. (We’ll cover them in more detail next time.)

My sources and further readings:

Arnold Dumey “Computers and Automation”
Arnold Dumey’s Interview for Charles Babbage Institute
Diffie and Hellman “New Directions in Cryptography”
Wikipedia about hash function
Chicago University about Abel-Ruffini Theorem