斑马-羊驼:通向高效混合模型
Zebra-Llama: Towards Efficient Hybrid Models

原始链接: https://arxiv.org/abs/2505.17272

## Zebra-Llama:高效混合语言模型 本文介绍Zebra-Llama,一种通过*结合*现有预训练模型来构建高效大型语言模型(LLM)的新方法,而非昂贵的完全重新训练。Zebra-Llama使用状态空间模型(SSM)和多头潜在注意力(MLA)层创建1B、3B和8B参数的混合模型,有效地从更大的Transformer模型中迁移知识,仅需最少量的训练——仅7-110亿个token,而预训练则需要数万亿个token。 主要优势在于训练成本和内存使用的显著降低。Zebra-Llama大幅缩小KV缓存大小(降至原始大小的2-3.9%),同时保持高精度,通常在性能上*超越*现有的高效模型,如MambaInLLaMA和Minitron。具体而言,8B版本在少量样本准确率方面比Minitron-8B高7%,同时训练数据和内存占用却大大减少,并且拥有2.6-3.8倍更高的吞吐量。 代码和模型检查点将在论文被接受后发布。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 Zebra-Llama: 朝向高效混合模型 (arxiv.org) 9 分,由 mirrir 1小时前发布 | 隐藏 | 过去 | 收藏 | 讨论 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索:
相关文章

原文

[Submitted on 22 May 2025]

View a PDF of the paper titled Zebra-Llama: Towards Extremely Efficient Hybrid Models, by Mingyu Yang and 4 other authors

View PDF HTML (experimental)
Abstract:With the growing demand for deploying large language models (LLMs) across diverse applications, improving their inference efficiency is crucial for sustainable and democratized access. However, retraining LLMs to meet new user-specific requirements is prohibitively expensive and environmentally unsustainable. In this work, we propose a practical and scalable alternative: composing efficient hybrid language models from existing pre-trained models. Our approach, Zebra-Llama, introduces a family of 1B, 3B, and 8B hybrid models by combining State Space Models (SSMs) and Multi-head Latent Attention (MLA) layers, using a refined initialization and post-training pipeline to efficiently transfer knowledge from pre-trained Transformers. Zebra-Llama achieves Transformer-level accuracy with near-SSM efficiency using only 7-11B training tokens (compared to trillions of tokens required for pre-training) and an 8B teacher. Moreover, Zebra-Llama dramatically reduces KV cache size -down to 3.9%, 2%, and 2.73% of the original for the 1B, 3B, and 8B variants, respectively-while preserving 100%, 100%, and >97% of average zero-shot performance on LM Harness tasks. Compared to models like MambaInLLaMA, X-EcoMLA, Minitron, and Llamba, Zebra-Llama consistently delivers competitive or superior accuracy while using significantly fewer tokens, smaller teachers, and vastly reduced KV cache memory. Notably, Zebra-Llama-8B surpasses Minitron-8B in few-shot accuracy by 7% while using 8x fewer training tokens, over 12x smaller KV cache, and a smaller teacher (8B vs. 15B). It also achieves 2.6x-3.8x higher throughput (tokens/s) than MambaInLlama up to a 32k context length. We will release code and model checkpoints upon acceptance.
From: Mehdi Rezagholizadeh [view email]
[v1] Thu, 22 May 2025 20:39:57 UTC (12,646 KB)
联系我们 contact @ memedata.com