Gemma 4 QAT 模型:优化移动端与笔记本电脑的压缩效率
Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency

原始链接: https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/

Gemma 4 发布两个月后,近期更新包括增加多标记预测(Multi-Token Prediction)以实现更快的推理,以及推出全新的 12B 模型。为了进一步提高易用性,团队现已发布经量化感知训练(QAT)优化的新检查点。 与可能降低模型质量的标准训练后量化(PTQ)不同,QAT 将量化直接集成到训练过程中。这在保持性能的同时显著减少了内存占用,使 Gemma 4 模型能够在消费级 GPU 和边缘设备上高效运行。 此次发布包括对标准 Q4_0 格式的支持,以及一种新型移动端优化方案,该方案将 Gemma 4 2B 模型的内存占用降低至仅 1GB。这些改进确保开发人员能够在显存和存储空间有限的硬件上利用 Gemma 4 的高质量与强大功能,使高性能人工智能模型比以往任何时候都更加便携。

谷歌近期发布了适用于 Gemma 4 系列的量化感知训练(QAT)模型,旨在提升其在移动设备和笔记本电脑硬件上的运行效率。尽管此次发布提供了有价值的规格参数(例如 12B 模型需要 6.7GB 显存),但由于发布版本过于碎片化,在开发者社区引发了争议。 Hacker News 上的用户指出,发布节奏混乱,大量模型变体(如基础版、助理版、MTP 和 QAT 版)接连涌现。这种复杂性为下游维护者带来了显著阻力,特别是在 *llama.cpp* 的兼容性及命名规则(如“E”标识)不明晰方面。开发者们对难以追踪特定版本模型适配哪些硬件或软件感到困扰,并指出官方博客有时会就 GGUF 文件的可用性发布误导性信息。尽管这些模型在端侧性能上具有技术优势,但普遍观点认为,谷歌无规律的发布策略和缺失的文档正给本地大模型生态带来不必要的负担。
相关文章

原文

Since releasing Gemma 4 two months ago, we've been continuously working to expand its capabilities. First, we introduced Multi-Token Prediction (MTP) to accelerate inference, and just a couple of days ago, we released a 12B model to bridge the gap between our E4B and 26B MOE models.

Today, we are releasing new checkpoints optimized with Quantization-Aware Training (QAT) to make Gemma 4 even more efficient, so you can run models locally on everyday edge devices and consumer GPUs.

By simulating quantization during training, QAT minimizes quality loss when the model is compressed. This release includes QAT checkpoints for the popular Q4_0 quantization format as well as a novel quantization format specialized for mobile use cases. Using this mobile format, we’ve reduced the memory footprint of Gemma 4 E2B to 1GB. Together, these dramatically reduce memory requirements while preserving the capabilities and quality you expect from Gemma 4.

Keeping model quality while making them smaller

Quantization is a key technology to run models on consumer hardware by reducing their memory footprint while also accelerating decode speed. However, standard Post-Training Quantization (PTQ) often leads to performance degradation. Instead of simply quantizing the model after training, QAT integrates the quantization process directly into training. While PTQ is already effective at preserving quality, our QAT results yield even higher overall quality compared to standard PTQ baselines.

We applied this QAT recipe to the popular Q4_0 format to maximize performance for all the models. For the edge models (E2B and E4B), we rethought how we approach quantization with a special mobile-specialized quantization schema.

Saving on VRAM and Storage

Below are the approximate memory requirements indicating how much VRAM is required to load the models:

联系我们 contact @ memedata.com