Gemma 4 QAT 模型：优化移动端与笔记本电脑的压缩效率

Gemma 4 QAT 模型：优化移动端与笔记本电脑的压缩效率
Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency

原始链接: https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/

Gemma 4 发布两个月后，近期更新包括增加多标记预测（Multi-Token Prediction）以实现更快的推理，以及推出全新的 12B 模型。为了进一步提高易用性，团队现已发布经量化感知训练（QAT）优化的新检查点。与可能降低模型质量的标准训练后量化（PTQ）不同，QAT 将量化直接集成到训练过程中。这在保持性能的同时显著减少了内存占用，使 Gemma 4 模型能够在消费级 GPU 和边缘设备上高效运行。此次发布包括对标准 Q4_0 格式的支持，以及一种新型移动端优化方案，该方案将 Gemma 4 2B 模型的内存占用降低至仅 1GB。这些改进确保开发人员能够在显存和存储空间有限的硬件上利用 Gemma 4 的高质量与强大功能，使高性能人工智能模型比以往任何时候都更加便携。

Google 发布了 Gemma 4 系列的“量化感知训练”（QAT）模型，旨在为移动和本地设备提供高性能与高精度的同时，显著降低内存和算力需求。 Hacker News 社区正在积极测试这些模型，并指出它们在普通的消费级硬件（包括笔记本电脑和手机）上表现出色，部分变体的大小仅为 1–3GB。讨论重点包括： * **效率：** 开发者已成功将 QAT 模型应用于实时的设备端音频、图像和文本处理。 * **优化：** 虽然 Google 官方的 QAT 基准测试表现强劲，但像 Unsloth 这样的社区成员正在创建更精细、高性能的量化版本，其表现优于标准版本，这引发了关于量化方法的良性讨论。 * **“本地 AI”之争：** 用户对本地模型的价值看法不一。支持者看重隐私、成本效益以及对云端 AI 提供商的独立性；批评者则认为，与 Claude 或 GPT 等成熟的云服务相比，专门的硬件/软件设置过于繁琐。总体而言，此次发布标志着让个人设备能够进行前沿 AI 推理的重要里程碑，并将推动本地自动化和自托管工具生态系统的增长。

原文

Since releasing Gemma 4 two months ago, we've been continuously working to expand its capabilities. First, we introduced Multi-Token Prediction (MTP) to accelerate inference, and just a couple of days ago, we released a 12B model to bridge the gap between our E4B and 26B MOE models.

Today, we are releasing new checkpoints optimized with Quantization-Aware Training (QAT) to make Gemma 4 even more efficient, so you can run models locally on everyday edge devices and consumer GPUs.

By simulating quantization during training, QAT minimizes quality loss when the model is compressed. This release includes QAT checkpoints for the popular Q4_0 quantization format as well as a novel quantization format specialized for mobile use cases. Using this mobile format, we’ve reduced the memory footprint of Gemma 4 E2B to 1GB. Together, these dramatically reduce memory requirements while preserving the capabilities and quality you expect from Gemma 4.

Keeping model quality while making them smaller

Quantization is a key technology to run models on consumer hardware by reducing their memory footprint while also accelerating decode speed. However, standard Post-Training Quantization (PTQ) often leads to performance degradation. Instead of simply quantizing the model after training, QAT integrates the quantization process directly into training. While PTQ is already effective at preserving quality, our QAT results yield even higher overall quality compared to standard PTQ baselines.

We applied this QAT recipe to the popular Q4_0 format to maximize performance for all the models. For the edge models (E2B and E4B), we rethought how we approach quantization with a special mobile-specialized quantization schema.

Saving on VRAM and Storage

Below are the approximate memory requirements indicating how much VRAM is required to load the models: