Llama-70B 224倍压缩,更高精度 (论文和代码)
Post-transformer inference: 224× compression of Llama-70B with improved accuracy

原始链接: https://zenodo.org/records/17873275

这项研究提出了一种**完全移除推理过程中的Transformer**的新方法,在不牺牲甚至*提高*准确性的情况下实现。其关键是从冻结的大型语言模型(Llama-3.3-70B)中提取一个低维“语义场”(256D),代表与任务相关的语义。 一个轻量级的压缩器(AN1)进一步减小这个场,平均将分类准确率提高1.81%。然后,一个拥有30M参数的小模型学习直接从文本*重构*这个场,从而实现**60倍更快的推理速度**,且准确率损失最小(0.35pp)。 核心发现是Transformer模型以令人惊讶的低秩结构编码信息。这使得创建**场处理单元(FPUs)**成为可能——一种新的计算基元,用更简单的场运算取代复杂的矩阵乘法。这项研究提供了一个完全可复现的基线实现,为未来高效、无Transformer的推理工作铺平道路。

## 后Transformer推理与极致压缩 一位研究者(anima-core)分享了一种大幅压缩大型语言模型(如Llama-70B)的新方法。该技术将模型提炼成一个256维的“语义场”表示——实现了**224倍的压缩**——同时在一些基准测试中*提高了*准确性。 核心思想是用一个小型“学生”模型取代完整的Transformer推理,该模型学习直接从文本生成这些语义场。这有效地将Transformer从推理过程中移除。GitHub上提供了一个参考实现,但未包含生产级别的优化。 虽然该论文因其清晰的符号表示而受到称赞,但评论员对可重复性提出了担忧。具体来说,用于实现峰值性能的架构细节是专有的,并且论文中的一些引用/图表已损坏。需要进一步研究来将该方法与标准的知识蒸馏技术以及更小、端到端训练的模型进行比较。作者正在寻求社区的反馈和复现尝试。
相关文章

原文

This paper introduces the first verified method to eliminate transformers from inference while preserving, and in many cases improving, downstream accuracy.

We show that a frozen 70-billion-parameter Llama-3.3-70B model can be replaced by a 256-dimensional meaning field extracted from seven internal activation layers. A lightweight compressor (AN1) reduces these fields by 224× with an average +1.81 percentage point gain across classification tasks, including +3.25 pp on low-resource RTE (R² = 0.98 inverse-scaling fit, p < 0.01). A 30M-parameter student then learns to regenerate these fields directly from raw text, enabling full transformer-free inference at 60× higher throughput with only 0.35 pp average accuracy loss.

The core insight is that task-aligned semantics in modern transformers occupy a remarkably low-rank manifold. Across layers we observe 72–99 percent of variance in the top one to three dimensions. Once this structure is extracted and learned, the transformer becomes unnecessary. It serves as a one-time sculptor of meaning rather than the permanent home of inference.

This work establishes Field Processing Units (FPUs) as a post-transformer compute primitive that replaces deep matrix multiplication with shallow field operations.

All results are averaged over five seeds with statistical significance reported. Ablations isolate the causal contributions of field supervision, geometric regularization, and anchor-layer selection.

This Zenodo release provides the complete scientific manuscript and the baseline reference implementation for the AN1 Core system. Proprietary optimizations (AN1-Turbo) have been removed to support independent verification and further research into post-transformer inference.

联系我们 contact @ memedata.com