Gemma 4 12B：一款统一的、无编码器多模态模型

Gemma 4 12B：一款统一的、无编码器多模态模型
Gemma 4 12B: A unified, encoder-free multimodal model

原始链接: https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/

Google 推出了全新的中型模型 **Gemma 4 12B**，旨在将高性能、代理型及多模态 AI 带入本地笔记本电脑。该模型填补了 E4B 与 26B 模型之间的空白，在保持高级推理能力的同时，大幅降低了内存占用，仅需 16GB 显存。主要特性包括： * **统一架构**：将视觉和音频输入直接整合至 LLM 主干网络中，无需额外的编码器。 * **原生模态**：首个支持原生音频输入的中型 Gemma 模型。 * **高效性**：配备多令牌预测（MTP）技术，以降低延迟。 * **易用性**：采用 Apache 2.0 许可证发布，确保其保持开放，并与更广泛的开发者生态系统兼容。 Gemma 4 12B 在保持轻量化的同时，实现了与大型模型相当的推理性能，使开发者能够在消费级硬件上直接运行最先进的 AI 智能体及复杂的多步工作流。此次发布是在 Gemma 系列取得 1.5 亿次下载这一里程碑之后推出的，延续了 Google 让强大的 AI 工具广泛普及的承诺。

Google 发布了全新的多模态模型 **Gemma 4 12B**，该模型无需编码器，能够处理文本、音频和图像。其设计目标是“笔记本电脑可用”，可运行在具备 16GB 显存或统一内存的硬件上。 **主要技术要点：** * **架构：** 该模型以“线性投影”方法（使用简单的矩阵乘法）取代了传统且计算密集型的视觉编码器。这降低了内存开销并简化了输入流程，但一些用户认为其多模态性能（尤其是在图像识别方面）不如大型专用模型或 Gemini 等闭源替代方案稳定。 * **性能：** 社区早期的基准测试结果好坏参半。虽然一些用户认为它是在本地辅助编程和处理结构化数据方面的高效工具，但也有用户反映，与大型模型或 Qwen 系列相比，该模型存在语法错误、推理过程中的“幻觉”以及图像识别困难等问题。 * **社区反馈：** Hacker News 上的讨论主要集中在硬件的可访问性以及大型科技公司模型发布的“闭环”特性上。尽管许多人赞赏其向高效、本地化 AI 发展的趋势，但也有人对内存要求、谷歌安全护栏的“黑箱”本质，以及在编程和数据任务中更倾向于使用 Qwen 3.6 等其他模型表达了不满。

原文

Today, we are introducing Gemma 4 12B, our latest model designed to bring agentic multimodal intelligence directly to laptops. Bridging the gap between our edge-friendly E4B and our more advanced 26B Mixture of Experts (MoE), Gemma 4 12B packages powerful capabilities inside a reduced memory footprint. It is also our first mid-sized model to feature native audio inputs.

Thanks to the developer community, Gemma 4 models have now crossed 150 million downloads. You’ve built everything from wearable robotic arms for physical assistance to enterprise-grade AI security. We're excited to see what you build with this latest addition.

Here’s an overview of what makes Gemma 4 12B unique:

Novel unified architecture: No multimodal encoders. The vision and audio inputs flow directly into the LLM backbone.
Advanced reasoning: Benchmark performance nearing our 26B model, unlocking powerful multi-step reasoning and agentic workflows.
Laptop ready: Small enough to run locally with just 16GB of VRAM or unified memory.
Open and accessible: Released under an Apache 2.0 license with support across the developer ecosystem.
Drafter-ready: Gemma 4 12B comes equipped with Multi-Token Prediction (MTP) drafters to reduce latency.

Together, these features bring advanced multimodal capabilities to everyday hardware without sacrificing speed or reasoning. Let's now take a closer look at how Gemma 4 12B achieves this.

Run state-of-the-art agents locally

Gemma 4 12B delivers performance nearing our larger 26B MoE model on standard benchmarks, but at less than half the total memory footprint. Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine.

Gemma 4 12B：一款统一的、无编码器多模态模型 Gemma 4 12B: A unified, encoder-free multimodal model

Run state-of-the-art agents locally

Gemma 4 12B：一款统一的、无编码器多模态模型
Gemma 4 12B: A unified, encoder-free multimodal model