（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=41286203

理论上，语言学习机（LLM）可以存储与隐马尔可夫模型（HMM）相同的数据。然而，实际应用表明 LLM 无法像 HMM 那样表现。这种差异主要在于结构。虽然 HMM 只关注特定符号，但 LLM 可以学习从简单符号序列到复杂语言模式的复杂规则。 LLM 的局限性主要源于其底层结构——具体来说，是 Transformer 架构。在这种结构中，每层的输入在进入下一层之前都会转换为新的表示形式。因此，当信号到达更深的层时，较早层学到的较低层规则往往会丢失。要解决此问题，可以： 1. 在推理过程中删除某些“中间层”，允许从较早的层直接连接到较晚的层。此方法可能无法提供最佳结果，因为法学硕士仍然依赖标记化方法，这可能会降低有效性。 2. 设计一个系统，确保较低级别的信息在各个层中持续存在，尽管中间层不需要该信息。例如，可以实现基于后进先出（LIFO）的结构，保留来自较早层的较低层信息并确保其对于未来层的可用性。尽管这种方法需要对现有模型进行重大更改并需要测试，但它有望使法学硕士能够在整个模型结构中一致地应用学到的规则。

An LLM trained on a given dataset should — at least in theory — "contain" (in a lossless-data-compression sense) a full superset of the knowledge of a Hidden Markov Model trained on the same dataset. I.e. that information is there, in the weights, in some form, and could in theory be used to reconstruct an equivalent HMM from the LLM.

Why can't we get LLMs to do what HMMs do, then?

Mostly, it comes down to the structure.

Markov models are "funny" because they just have one level of abstraction: tokens. Markov "inference" is predicting the next token, given the last N tokens, and a model that knows weights for what tokens follow what N-tuples of previous tokens. And due to that limitation, the only rules that HMMs can learn, are low-level rules that don't require any additional abstraction: they can't optimize for syntactically-valid English, let alone semiotically logical statements; but they can make the text "feel" good in your head [i.e. the visual equivalent of song vocals having nice phototactics] — and so that's what training the model leads it to learn to do. And it turns out that that combination — text that "feels" good in its phrasing, but which is syntactically invalid — happens to read as "funny"!

LLMs aren't under the same constraint. They can learn low-level and high-level rules. Which means that they usually do learn both low-level and high-level rules.

The only thing stopping LLMs from using those low-level rules, AFAICT, is the architectures most LLMs are built on: the (multi-layer) Transformer architecture. Transformer LLMs are always a single-pass straight shot ("feed forward") through a bunch of discrete layers (individual neural networks), where at each step, the latent space (vocabulary) of the layer's inputs is getting paraphrased into a different latent space/vocabulary at the layer's outputs.

This means that, once you get into the middle of a Transformer's layer sandwich, where all the rules about abstract concepts and semiotics reside, all the low-level stuff has been effectively paraphrased away. (Yes, LLMs can learn to "pass through" weights from previous layers, but there's almost always a training hyperparameter that punishes "wasteful" latent-space size at each layer — so models will only usually learn to pass through the most important things, e.g. proper names. And even then, quality on these "low-level" inferences are also the sorts of things that current test datasets on LLM ignore, leading to training frameworks feeling free to prune away these passthrough nodes as "useless.")

This problem with LLMs could be fixed in one of two ways:

1. the "now it's stupid but at least it rhymes" approach

Allow inference frameworks to simply bypass a configurable-per-inference-call number of "middle layers" of a feed-forward multi-layer network. I.e., if there are layers 1..N, then taking out layers K..(N-K) and then directly connecting layer K-1 to layer N-K+1.

At its most extreme, with layer 1 connected to layer N, this could very well approximate the behavior of an HMM. Though not very well, as — given the relatively-meaningless tokenization approach most LLMs use (Byte Pair Encoding) — LLMs need at least a few transforms to get even to the point of having those tokens paraphrased into "words" to start to learn "interesting" rules. (AFAIK in most Transformer models layers 1 and N just contain rules for mapping between tokens and words.)

Meanwhile, this would likely work a lot better with the "cut and graft" happening at a higher layer, but getting the "graft" to work would likely require re-training (since layers K-1 and N-K+1 don't share a vocabulary.)

...except if the LLM is an auto-encoder. Auto-encoder LLMs could just run an inference up their layerwise "abstraction hierarchy" to any arbitrary point, and then back down, without a problem!

(I'd really love to see someone try this. It's an easy hack!)

2. the "it can write poetry while being smart" approach

Figure out a way, architecturally, to force more lower-layer information from the early low-level to be passed through to the late low-level, despite the middle layers not having any reason to care about it. (I.e. do something to allow the LLM to predict a word Y at layer N-3 such that it rhymes with word X known at layer 3, while not otherwise degrading its capabilities.)

Most simply, I think you could just wire up the model with a kind of LIFO-bridged layer chain — where every layer K is passing its output to the input of layer K+1; but, for any given layer K in the first half of the layers, it's also buffering its output so that it can become an additional input for its "matching" layer N-K.

This means that all the layers in the "second half" of the model would receive longer inputs, these being the concatenation of the output of the previous layer, with the output of the matching "equal in abstraction depth" input layer. (Where this equal-in-abstraction-depth association between layers isn't inherently true [except in auto-encoder models], but could be made true in an arbitrary model by training said model with this architecture in place.)

(Again, I'd really love to see someone try this... but it'd have to be done while training a ground-up base model, so you'd need to be Google or Facebook to test this.)

（评论） (comments)

（评论）
(comments)