为什么神经网络和密码学密码如此相似？

为什么神经网络和密码学密码如此相似？
Why are neural networks and cryptographic ciphers so similar? (2025)

原始链接: https://reiner.org/neural-net-ciphers

## 意想不到的平行：神经网络与密码学尽管看似不同，训练语言模型和加密数据却存在令人惊讶的算法相似性。这两个领域都严重依赖于信息的**顺序和并行处理**——最初类似于循环神经网络和SHA-3哈希函数，并发展为利用位置编码进行并行处理，就像Transformer架构和快速消息认证码一样。在其核心，两者都利用重复的**线性与非线性变换**来“混合”数据，在无需定制设计的情况下实现复杂性。这扩展到以网格组织数据并交替混合模式——在神经网络的注意力/前馈层和AES和ChaCha20等密码算法中，都存在行/列混合——优先考虑高效且可并行化的操作。这些平行并非直接复制的结果，而是源于共同的底层属性：**弱正确性要求**（网络的微分性，密码学的可逆性）、对**复杂性和彻底混合**信息的关注，以及对**性能的强烈重视**。这些约束自然会导致相似的解决方案——深度并行、重复层的混合器——展示了一种“算法趋同进化”的形式。这表明这两个领域之间进一步的交叉授粉是可能的，并可能导致双方的新颖进展。

## 神经网络与密码学：令人惊讶的相似性最近的讨论探讨了神经网络（NN）和密码学密码之间令人惊讶的相似之处。虽然并非完全相同，但这两个领域都在处理“信息混合”——确保输入的变化会显著影响输出。密码的目标是实现*完美的*混合，以防止密钥或明文的推断，而神经网络*容忍*混合，因为即使在复杂的层中，推断仍然是可能的。神经网络对混合的容忍至关重要，因为理想的数据流事先未知，需要一个灵活的系统。这两个领域也深受硬件限制的影响，并为高效计算进行优化。有人认为，这种相似性源于两者都在处理巨大的状态空间。讨论强调了这两个领域如何受益于最大化熵，并共享结构上的相似性——例如迭代变换和非线性函数。最终，密码学和机器学习都是试图将复杂性分解为可管理单元的尝试，由于潜在的压力而趋向于相似的解决方案。然而，人们也警告说，表面上的相似性不应掩盖目标上的根本差异：可预测性与近似性。

原文

At first glance, training language models and encrypting data seem like completely different problems: one learns patterns from examples to generate text, the other scrambles information to hide it. Yet their underlying algorithms share a curious resemblance, and it’s not for lack of creativity.

Sequence processing: the sequential version

Consider the venerable recurrent neural network, feeding text token by token into a recurrent state before generating the output text:

This is structurally identical to the Sponge construction in SHA-3, absorbing bytes into a state before squeezing out the hash:

Perhaps this similarity isn’t surprising: to process variable-length input into a fixed-size state, absorbing sequentially is a natural choice.

Sequence processing: the parallel version

Modern hardware is parallel all the way down, so sequential absorbing wastes performance. Both fields found the same solution: run the expensive function $f$

Addition loses ordering information, so both approaches recover ordering by adding position encodings to each chunk.

In neural networks, this construction drives the Transformer architecture, which improved upon sequential recurrent networks. In cryptography, this construction powers the fastest Message Authentication Codes.

The basic primitive: alternating linear and nonlinear layers, repeated identically

Strip away the variable-length processing. What’s inside the core function? The same pattern in both fields: linear transform, nonlinear transform, repeat:

Linear transforms provide “mixing” between different vector positions, allowing many vector elements to influence many other vector elements. Nonlinear transforms provide complexity: without them, the whole stack of layers would degenerate to a single linear transform.

Both fields repeat this identical layer many times rather than crafting bespoke structures. This focuses research and engineering effort: one layer type to analyze, and to optimize in software or in silicon.

Efficient mixing: alternating rows and columns

Zoom in further. Both fields organize their state as a grid and alternate between mixing rows and mixing columns:

In neural nets: attention mixes across sequence positions (rows), while feed-forward layers mix within each position (columns). In the AES cipher: ShiftRows permutes across columns while MixColumns combines within them. The ChaCha20 cipher alternates row-wise and diagonal mixing.

This factored approach often beats mixing the entire state at once. It’s often asymptotically faster if the mixing step is slower than linear: e.g. under quadratic mixing, mixing $n$

What’s causing the similarities?

The similarities do not appear to be due to shallow copying of ideas: the research papers and histories of the fields do not reveal much copying between the fields. Instead, there are some underlying similarities between the problem statements.

What distinguishes neural networks and symmetric cryptography from other fields of algorithm design are the following three properties.

1. The correctness property demanded of the algorithm is remarkably weak

Most algorithms face strong correctness requirements. Compilers must preserve program meaning. Databases must return exactly what was stored. Network routers must deliver the packet.

In comparison, cryptography just needs invertibility, to avoid information loss. Neural networks need just differentiability, for gradient descent. You can build a wide range of invertible or differentiable functions simply by composing smaller invertible or differentiable functions.

This freedom enables radical simplicity. Both fields build from two or three simple primitives repeated in a loop: simple enough to implement in 20 lines of code. This freedom also enables rapid experimentation: 50+ SHA-3 submissions, hundreds of attention variants. When almost any function could “work”, you can optimize your other goals more aggressively.

2. Quality requirements focus on complexity and mixing

More than the basic correctness requirement, both fields share a similar notion of quality. Cryptography needs every output bit to depend on every input bit in complicated ways. Neural networks need the outputs to make the best use of all input information. Both of these reward designs that allow every part of the state to interact with every other part of the state, over and over again. Hence the repeated mixing layers: information must flow between positions not once but many times, creating rich interdependencies.

Other fields value mixing but not complexity: sorting requires every output to be compared to every input; network topologies such as Clos networks require every output to be reachable from every input. These fields tend to produce algorithms that interact all inputs with each other exactly once and then finish, whereas cryptography and neural networks repeat the interaction many times.

3. Unusually large emphasis on performance

These fields are rare among algorithmic fields in the emphasis placed on low-level hardware performance, routinely including assembly implementations and custom hardware. This emphasis arises from economic pressures such as the ubiquity of encryption and the massive scale of neural networks.

Emphasizing performance rewards simple algorithms: it makes assembly implementations or custom hardware tractable. Emphasizing performance also rewards parallelism that we saw at every level of the design: parallel sequence processing at the top level, parallel mixers like alternating row/columns at the middle level, and linear algebra—which is easy parallelizable—at the lowest level.

Convergent evolution in algorithms

These parallels suggest something fundamental: when we demand algorithms that mix thoroughly and in a complex way, have few other correctness requirements, and perform extremely well on hardware, the best solutions may look very similar. Just as biological evolution independently invented eyes multiple times, human research seems to have invented the “deeply parallel repeated-layer mixers” structure multiple times.

We’ve already seen ideas jump between fields. RevNets brought cryptography’s Feistel networks to neural networks, enabling reversible layers that save memory. What’s next? Are there neural network analogs of Column Parity Mixers or “unaligned mixers”?