弥散损失抵消了小型语言模型中的嵌入凝聚现象

弥散损失抵消了小型语言模型中的嵌入凝聚现象
Dispersion loss counteracts embedding condensation in small language models

原始链接: https://chenliu-1996.github.io/projects/LM-Dispersion/

本文提出了“嵌入坍缩”（embedding condensation）这一概念，即较小语言模型中的词嵌入会坍缩至一个狭窄的锥形子空间，从而限制了其表达能力。观察结果证实，这种现象在小模型中比在大模型中更为显著，存在于各种数据集，且源于模型初始化阶段。关键在于，作者证明了利用大模型进行知识蒸馏无法缓解这一坍缩问题。为解决该问题，作者引入了“离散损失”（dispersion loss）。该训练目标旨在通过鼓励嵌入在单位超球面上分布开来，从而抵消坍缩。通过促进均匀的角离散，该技术使小模型能够获得更接近大模型的高质量潜在表征。实验结果表明，在训练中加入离散损失可有效缓解嵌入坍缩，为在不增加参数规模的前提下缩小小型与大型语言模型之间的性能差距提供了有效路径。作者总结认为，模型的优异性能不仅源于其规模，还源于其潜在信息的结构化组织方式。

Hacker News 上近期的一项讨论关注了研究论文《分散损失抵消小型语言模型中的嵌入凝聚》（Dispersion loss counteracts embedding condensation in small language models）。该讨论的核心议题涉及“嵌入凝聚”（embedding condensation）现象，以及将“分散损失”（dispersion loss）作为一种缓解策略的应用，特别是在小型语言模型中的应用。评论者探讨了大型语言模型（LLM）存储信息方式的更广泛影响，并引用了“语言模型物理学”（Physics of Language Models）框架，该框架估计模型每个参数大约保留两比特的事实知识。参与者指出，这项研究最可能适用于小规模模型，这与大型语言模型通常所需的海量计算资源形成了对比。讨论还涉及了参数分布与模型量化能力之间的关系，认为更广泛的分布能够使模型更有效地被压缩。

原文

This paper presents an observation-driven improvement on language model training.

We observe a geometric phenomenon which we term embedding condensation, where token embeddings collapse into a narrow cone-like subspace in smaller language models. We then design a training objective called dispersion loss to counteract the effect.

Figure 1. Illustration of the embedding condensation phenomenon. In pre-trained language models, embeddings of all tokens from the same input sequence condense into a narrow cone after being processed by many Transformer layers. This phenomenon is substantially more pronounced in smaller models than in larger models within the same family, which motivates our hypothesis in Section 3.3.

Feature 1: Larger model, less condensation.
Within the same model family, smaller models exhibit more severe embedding condensation, with token embeddings collapsing toward near-parallel directions, while larger models resist this collapse.

Figure 2. Qualitative and quantitative observations of the embedding condensation phenomenon. a. The cosine similarity heatmaps demonstrate that smaller models (e.g., `GPT2`, `Qwen3-0.6B`) are susceptible to condensation, since token cosine similarities become increasingly positive as the embeddings proceed to deeper layers. In contrast, larger models (e.g., `GPT2-xl`, `Qwen3-32B`) are more resistant to embedding condensation. b. Quantifications using Spearman correlation and Kendall’s Tau demonstrate a consistent trend of “larger model, less condensation” across multiple families of language models. Additional results can be found in Figure S1.

This effect is also quite robust to the choice of input datasets.

Figure S2. The embedding condensation effect is consistent regardless of the input text dataset. Results are shown for four datasets, namely **(a)** `wikitext`, **(b)** `pubmed_qa`, **(c)** `imdb`, and **(d)** `squad`.

Feature 2: Reproducible when controlling for confounders.
To isolate the effect of model size from other confounding factors, we conduct a controlled experiment where we pre-train GPT2-like models, varying only the MLP dimension while keeping all other components fixed, including the number of layers, embedding dimension, dataset, and training settings. The same phenomenon is observed.

Figure 3. In a highly controlled experiment, we reproduced the observation of “larger model, less condensation”. We pre-trained four `GPT2`-like models of varying sizes that differ only in MLP dimension, while keeping all other factors fixed, including the number of layers, embedding dimension, dataset, and training configuration. The resulting models exhibit consistent trends in embedding condensation, shown qualitatively (panel a) and quantitatively (panel b). Horizontal dashed lines are added to panel a for easier visual comparison.

Feature 3: Condensation occurs early on.
The embedding condensation phenomenon emerges at model initialization and is gradually mitigated, not exacerbated, by pre-training.

Figure 4. Embedding condensation is observed immediately after model initialization. We analyze checkpoints of `Olmo-3-1025-7B` spanning initialization, intermediate pre-training stages, and the final base model. Each checkpoint is annotated by its training stage and the number of training tokens.

Feature 4: Distillation is not a solution.
Knowledge distillation from a larger model does not transfer the desired resistance to embedding condensation.

Figure 5. Knowledge distillation is not a remedy to embedding condensation, shown qualitatively (panel a) and quantitatively (panel b).

Dispersion loss
Embedding condensation reduces the expressivity of Transformers by collapsing token embedding vectors into narrow cones, under-utilizing the representation space. We hypothesize that by dispersing embeddings during training, smaller models can achieve representational qualities more similar to larger models, thus narrowing the performance gap without increasing the number of parameters.

Figure 6. Illustration of how dispersion loss and its alternative formulations promote embedding dispersion. a. Dispersion loss enforces uniform angular dispersion by spreading out all pairs along the unit hypersphere. b. Decorrelation loss encourages different feature dimensions to remain uncorrelated. c. ℓ₂-repel loss increases pairwise Euclidean distance, while the norm regularization prevents unbounded expansion. d. Orthogonalization loss spreads out vectors forming acute angles while leaving obtuse ones unchanged.

Our dispersion loss is inspired by the "Diffuse and Disperse" paper with practical modifications.

Table 1. Our dispersion loss and its alternative formulations. Main implementation differences from Diffuse and Disperse are highlighted in teal and magenta. Including or excluding diagonal terms yields identical gradients and is therefore cosmetic. For dispersion loss and ℓ₂-repel, we adopt the `log-sum-exp` trick for numerical stability, which differs from `log(mean(exp(·)))` only by an additive constant. For ℓ₂-repel, we include a norm regularization term to prevent unbounded expansion of embeddings. For Orthogonalization, the distance margin is fixed to ¹⁄₂ since we use angular distance, where ¹⁄₂ corresponds to orthogonality and thus serves as the ideal margin.

Dispersion loss counteracts the embedding condensation effect during mid-training and pre-training. A qualitative result is shown below, while more quantitative results can be found in the paper.

Figure 7. Dispersion loss counteracts the embedding condensation phenomenon. a. Starting from condensed embeddings (gray dashed box), mid-training with the default loss has a limited impact (green box). b. In contrast, mid-training with our dispersion loss as a regularizer substantially mitigates embedding condensation (blue box).

Conclusion
Larger language models are better than smaller language models, but might not merely because they have more parameters. It can be partially attributed to how they organize the information in the latent representations. We hope to see future efforts along this interesting direction.

弥散损失抵消了小型语言模型中的嵌入凝聚现象 Dispersion loss counteracts embedding condensation in small language models

弥散损失抵消了小型语言模型中的嵌入凝聚现象
Dispersion loss counteracts embedding condensation in small language models