加速 Gemma 4：使用多token预测草稿进行更快的推理

加速 Gemma 4：使用多token预测草稿进行更快的推理
Accelerating Gemma 4: faster inference with multi-token prediction drafters

原始链接: https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/

## 推测解码：更快的LLM推理标准的LLM推理速度受限于数据在内存和处理单元之间移动的瓶颈。**推测解码**通过将*生成*与*验证*分离来解决这个问题。它利用一个快速、轻量级的“草稿者”模型（如MTP）同时预测多个token，而更强大的“目标”模型（例如Gemma 4）并行验证它们。这使得系统能够在传统上生成一个token的时间内，潜在地输出多个token *以及* 来自目标模型的token。这项技术可以显著提高聊天机器人和智能体等应用的响应速度，能够在消费级硬件上运行更大的模型，并提高设备端性能，通过减少处理时间和节省电池寿命——所有这些都**不会牺牲主要目标模型的准确性或推理能力**。本质上，它解锁了跨各种平台（从边缘设备到工作站）更快的AI。

## Gemma 4 推理速度改进一篇最近的谷歌博客文章详细介绍了 Gemma 4 的进展，重点是通过“多token预测”实现更快的推理。这项技术显著提升了速度，可能超过每秒 100 个 token。 Hacker News 上的讨论显示，用户们渴望使用 LM Studio 等工具来测试这项技术，但目前实现受限。它尚未在流行的后端（如 `mlx` 或 `llama.cpp`）中得到支持。用户指出，速度提升尤其显著，有人报告称，将 Gemma 与 Qwen 模型进行比较时，能明显感受到差异。此外，还有关于谷歌为何不更积极地推广 Gemma 的云端推理的争论，因为它可以在 Vertex 和 Gemini 上使用。一些人推测，为如此小的模型托管商业堆栈可能在成本上不可行，甚至可能影响相对于其它的定价。最后，包含了一个 Y Combinator 2026 夏季申请公告。

原文

Why speculative decoding?

The technical reality is that standard LLM inference is memory-bandwidth bound, creating a significant latency bottleneck. The processor spends the majority of its time moving billions of parameters from VRAM to the compute units just to generate a single token. This leads to under-utilized compute and high latency, especially on consumer-grade hardware.

Speculative decoding decouples token generation from verification. By pairing a heavy target model (e.g., Gemma 4 31B) with a lightweight drafter (the MTP model), we can utilize idle compute to “predict” several future tokens at once with the drafter in less time than it takes for the target model to process just one token. The target model then verifies all of these suggested tokens in parallel.

How speculative decoding works

Standard large language models generate text autoregressively, producing exactly one token at a time. While effective, this process dedicates the same amount of computation to predicting an obvious continuation (like predicting “words” after “Actions speak louder than…”) as it does to solving a complex logic puzzle.

MTP mitigates this inefficiency through speculative decoding, a technique introduced by Google researchers in Fast Inference from Transformers via Speculative Decoding. If the target model agrees with the draft, it accepts the entire sequence in a single forward pass —and even generates an additional token of its own in the process. This means your application can output the full drafted sequence plus one token in the time it usually takes to generate a single one.

Unlocking faster AI from the edge to the workstation

For developers, inference speed is often the primary bottleneck for production deployment. Whether you are building coding assistants, autonomous agents that require rapid multi-step planning, or responsive mobile applications running entirely on-device, every millisecond matters.

By pairing a Gemma 4 model with its corresponding drafter, developers can achieve:

Improved responsiveness: Drastically reduce latency for near real-time chat, immersive voice applications and agentic workflows.
Supercharged local development: Run our 26B MoE and 31B Dense models on personal computers and consumer GPUs with unprecedented speed, powering seamless, complex offline coding and agentic workflows.
Enhanced on-device performance: Maximize the utility of our E2B and E4B models on edge devices by generating outputs faster, which in turn preserves valuable battery life.
Zero quality degradation: Because the primary Gemma 4 model retains the final verification, you get identical frontier-class reasoning and accuracy, just delivered significantly faster.