从零开始:训练一个30M拓扑Transformer
Starting from scratch: Training a 30M Topological Transformer

原始链接: https://www.tuned.org.uk/posts/013_the_topological_transformer_training_tauformer

## Tauformer:一种新的Transformer注意力机制 Tauformer是一种新型的Transformer架构,旨在提高效率并将领域知识注入到注意力机制中。它用基于“taumode”(从代表“领域记忆”的图拉普拉斯算子导出的标量)的系统取代了标准的点积注意力。注意力是使用这些taumode之间的距离计算的,优先考虑领域相关的关系。 一个30M参数的TauGPT模型已被训练,并显示出令人鼓舞的早期结果。验证损失在前2000步内从4.93降至2.36,在4500步后达到最佳损失1.91 – 实现约60K tokens/秒。一个关键优势是潜在的缓存减少;Tauformer存储一个与值一起的标量,而不是完整的键/值张量,可能节省约50%的缓存空间。 进一步的研究集中在“自适应”taumode策略上,根据训练动态重新校准taumode,并将模型扩展到100M参数。有趣的是,taumode值似乎与损失降低同时收敛,表明学习到的表示与潜在的领域结构之间存在相关性。这种架构旨在增加可学习的结构,从而可能带来更高效和强大的Transformer。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 从零开始:训练一个30M拓扑Transformer (tuned.org.uk) 6点 由 tuned 51分钟前 | 隐藏 | 过去 | 收藏 | 讨论 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索:
相关文章

原文

Tauformer is a topological transformer (see paper) that replaces dot‑product attention with a Laplacian-derived scalar (taumode) per token/head, then attends using distances in that scalar space. Below is a post-style overview of the idea and the first training signals from a 30M-parameter run.

Tauformer in one idea

Tauformer’s goal is to inject domain structure directly into attention by using a Graph Laplacian built from a domain embedding space (a “domain memory”) as a persistent reference. Instead of ranking keys by \(Q\cdot K\), Tauformer ranks them by how similar their Laplacian-derived taumode scalars are, which is intended to bias attention toward domain-relevant relations rather than generic geometric similarity.

At the implementation level, Tauformer keeps the familiar Q/K/V projections, RoPE, causal masking, and stable softmax/value aggregation pipeline, but changes how attention logits are computed. Each head vector is compressed into a scalar \(\lambda\) using a bounded Rayleigh-quotient energy computed with a feature-space Laplacian \(L\), then logits are computed as a negative distance \(-|\lambda_q-\lambda_k|/\text{temperature}\).

Key building blocks (as implemented):

  • Taumode scalar: compute \(E_{\text{raw}}=(x^\top L x)/(x^\top x+\varepsilon)\), then bound it as \(E_{\text{raw}}/(E_{\text{raw}}+\tau)\) to produce \(\lambda\in[0,1)\).
  • Logits: \(\text{att}_{ij} = -\|\lambda^Q_i - \lambda^K_j\|/\text{temperature}\), then reuse causal mask \(→\) subtract row max \(→\) softmax \(→\) multiply by \(V\).

Why it can be cheaper

Because scoring no longer needs full key vectors, Tauformer’s KV-cache can store values plus a compact key-side scalar stream rather than both K and V tensors. Concretely, the cache payload is \((V,\lambda_k)\) (not \((K,V)\)), which yields an approximate ~50% per-layer cache reduction for typical head dimensions (small overhead for storing the extra scalar).

The design also anticipates using a sparse Laplacian from a precomputed domain manifold so computing \(\lambda\) can depend on Laplacian sparsity (nnz) rather than dense \(D^2\) multiplication. It exchanges the long preliminary adjustment of weights with a pre-training shorter phase in which a Laplacian is built using arrowspace.

Run setup (what was trained)

This run trains a 30M-class TauGPT. Training uses AdamW with base LR \(5\times10^{-4}\) and a warmup of 100 steps, then keeps the base LR constant unless the plateau logic scales it down. Data comes from a local JSONL file (train.jsonl) streamed through an IterableDataset, with a routed split where every 20th batch is used for validation (\(≈5%\)).

Category Setting Value
Model Class / size TauGPT ~30M parameters (GPT2-inspired)
Model Layers (n_layer) 6
Model Heads (n_head) 6
Model Embedding size (n_embd) 384
Model Sequence length (seq_len) 1024
Model Vocabulary size (vocab_size) 30522
Optimizer Optimizer AdamW
Optimizer Base learning rate 5e-4
LR schedule Warmup 100 steps
LR schedule Post-warmup behavior Constant LR (no decay unless manually/externally adjusted)
Data Source file Local JSONL file train.jsonl
Data Loading mode Streamed via an IterableDataset-style pipeline (no shuffle in DataLoader)
Validation Split rule Routed split where every 20th batch is used for validation
Validation Approx. validation fraction About 5%

Results at a glance

At step 100 the run reports train loss 4.6772 and val loss 4.9255 (PPL 107.47), and by step 2000 it reaches val loss 2.3585 (Perplexity 6.59). The best validation point in the log is step 4500 with val_loss=1.9146, after which validation regresses to 2.3746 by step 5000. The final run summary records step=5000, best_val_loss=1.914555, current_lr_scale=0.03125, and total_tokens=655360000. That is a good result for \(~2\) hours of training on this smallest model (at an average of ~60K Tokens Per Second).

The early phase is strong: validation drops from 4.93 at step 100 to ~2.36 by step 2000, showing that the model and pipeline learn effectively at this scale. After that, validation becomes noisy (e.g., rising back to 2.92 at step 2100 and peaking near 2.95 at step 4200) before the late “lucky break” to 1.91 at step 4500. Throughout, the run holds a fixed taumode value which means the attention geometry is not being updated as weights evolve as this will be take place in the next iterations.

Baseline: Closing note

All the model’s files, data, training settings and logs will be published with a permissive license once the results are consolidated and tests will move to a larger scale model.

This baseline run kept taumode fixed throughout, while using a simple validation loop and plateau-triggered LR scaling, and it still converged quickly in the early-to-mid training window.

Because the later part of the run shows volatility and regression after the best checkpoint, the next experiments focus on “adaptive” taumode strategies where taumode is recalibrated at intervals (including the “gradient” strategy that detects energy drift and gates recalibration by performance of the gradient in the previous steps) plus more sophisticated validation behaviors already implemented in the training loop.

Considering the small model size and the short training horizon (5,000 steps total, lowest loss at 4600), these results support the architecture as promising, with broader evaluation and scaled tests planned next—especially at 100M parameters.

A very interesting question has been raised by this test: what is the correlation between cross-entropy and taumode? Model convergence brings the loss down but at the same time recalibrating the taumode used on the learned weights brings down the taumode.

Cross-entropy and taumode are likely correlated because Tauformer’s attention kernel is built from Laplacian-derived scalar energies (λ/taumode) rather than dot-product similarity, so changes in the λ distribution change attention behavior and therefore training dynamics. In the current training loop, the observed “taumode convergence” is also mechanically explained by how taumode is recalibrated: on (re)start, the code can compute a median energy from block0 key (K) vectors produced by the current weights and then set that median as the global taumode.

What “converging taumode” means here

The calibration is effectively computing a Rayleigh-style energy statistic on K vectors under a Laplacian (numerator/denominator), and then taking a median over the batch to set a single scalar taumode. In the reference implementation, taumode/λ is based on a bounded Rayleigh quotient: \(E_{\text{raw}}(x) = \frac{x^\top L x}{x^\top x + \varepsilon}\) and then \(\lambda_\tau(x)=\frac{E_{\text{raw}}}{E_{\text{raw}}+\tau}\), which maps energies into \([0,1)\).

Why taumode can drift downward as loss improves

  • Healthy interpretation: as training progresses, the model may learn K representations that are “smoother” (lower-energy, so closer) with respect to the domain/manifold Laplacian, pushing the median energy down while also improving next-token prediction (lower cross-entropy).
  • Unhealthy interpretation (collapse risk): median energy can also drop if K vectors collapse toward low-variance or less-discriminative configurations, which can reduce contrast in λ-distance logits even if loss continues improving short-term.
  • Key confound: if taumode is recalibrated on resume, then taumode changes are not purely a passive “measurement of convergence”; they can act like a mid-training hyperparameter change, so correlation with loss does not automatically imply causality in the direction “lower taumode \(⇒\) lower loss”.

A strong explanation for “converging taumode” (as a property of learned representations, not an artifact) is: as weights converge, the distribution of per-token energies \(x^\top L x\) stabilizes, so repeated measurements (median, p50) across batches and checkpoints become consistent and typically shift toward lower-energy manifold-aligned directions. To validate that, it helps to separate (1) the fixed constant used by attention from (2) a purely diagnostic “current batch median energy”, and track not just the median but also the spread (p05/p95), because collapse would show shrinking spread even when the median looks lower.

“lower loss \(⇒\) lower taumode” is a plausible causal direction in Tauformer, because the cross-entropy gradient flows through the Tauformer attention path that depends on Laplacian-energy-derived scalars computed from Q/K (and in your calibration code, specifically from block0 K vectors). As the model improves next-token prediction, it can simultaneously learn representations whose Laplacian Rayleigh energy is lower, so any “recalibrate taumode from learned weights” procedure will tend to output a smaller median. If this it true, where is the optimal stopping state?

Further readings

Some shift is happening in understanding information thanks to large scale learning machines!

In this recent paper, MDL refers to the “minimum description length principle”, which says the best explanation/model is the one that minimizes the total code length needed to describe (1) the model and (2) the data given the model. Epiplexity \(ST(X)\) is defined as the program length of the compute-feasible model \(P\) that minimizes time-bounded MDL, while time-bounded entropy HT(X) is the expected code length of the data under that model. Operationally, the paper proposes practical estimators based on neural-network training dynamics (e.g., prequential “area under the loss curve above final loss”) to approximate how much structure a bounded learner actually absorbs from data

Qualitatively, arrowspace, taumode and tau-attention are exactly the kind of deterministic computations that can increase usable/learnable structure for bounded learners, which is one of the central motivations for epiplexity. Through the epiplexity lens, the operations carried on by arrowspace and Tauformer (converts each head vector into a bounded scalar λτ using a Rayleigh-quotient-style energy followed by a bounding map) is a deterministic compression that can re-factor information into a form that is cheaper for downstream computation to exploit, potentially increasing the amount of structure a bounded observer can learn from the same underlying signal.

I am happy I have somehow anticipated this switch in point of view in arrowspace.

Acknoledgements

I gratefully acknowledge Enverge Labs for kindly providing the computation time used to run these experiments on their H100 GPU cluster powered by clean and cheap energy, this aligns perfectly with the topological tranformer objective to provide cheaper computation for Transformers.

联系我们 contact @ memedata.com