Gemma 4 12B：一款统一的、无编码器多模态模型

Gemma 4 12B：一款统一的、无编码器多模态模型
Gemma 4 12B: A unified, encoder-free multimodal model

原始链接: https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/

Google 推出了全新的中型模型 **Gemma 4 12B**，旨在将高性能、代理型及多模态 AI 带入本地笔记本电脑。该模型填补了 E4B 与 26B 模型之间的空白，在保持高级推理能力的同时，大幅降低了内存占用，仅需 16GB 显存。主要特性包括： * **统一架构**：将视觉和音频输入直接整合至 LLM 主干网络中，无需额外的编码器。 * **原生模态**：首个支持原生音频输入的中型 Gemma 模型。 * **高效性**：配备多令牌预测（MTP）技术，以降低延迟。 * **易用性**：采用 Apache 2.0 许可证发布，确保其保持开放，并与更广泛的开发者生态系统兼容。 Gemma 4 12B 在保持轻量化的同时，实现了与大型模型相当的推理性能，使开发者能够在消费级硬件上直接运行最先进的 AI 智能体及复杂的多步工作流。此次发布是在 Gemma 系列取得 1.5 亿次下载这一里程碑之后推出的，延续了 Google 让强大的 AI 工具广泛普及的承诺。

Google 发布了 **Gemma 2 12B**（在讨论帖中被称为 Gemma 4），这是一款全新的开放权重、无编码器多模态模型。其最显著的特点是摒弃了专用的视觉编码器（如 SigLIP），转而采用轻量级的单层投影模块。虽然这简化了架构，但一些用户对这种 3500 万参数方法的稳健性提出了质疑。该模型旨在本地消费级硬件（16GB 内存）上运行，不过开发者指出，要实现这一性能通常需要进行量化。虽然 12B 模型在 4 位量化下通常被认为具有高质量，但社区仍在等待进一步的基准测试，以确定该基础模型的性能是否足以支撑其架构上的变动。总体而言，此次发布受到了 Hacker News 社区的欢迎，许多人将 Google 目前对开放权重模型的投入与 Meta 早期的 Llama 策略进行了正面比较。

原文

Today, we are introducing Gemma 4 12B, our latest model designed to bring agentic multimodal intelligence directly to laptops. Bridging the gap between our edge-friendly E4B and our more advanced 26B Mixture of Experts (MoE), Gemma 4 12B packages powerful capabilities inside a reduced memory footprint. It is also our first mid-sized model to feature native audio inputs.

Thanks to the developer community, Gemma 4 models have now crossed 150 million downloads. You’ve built everything from wearable robotic arms for physical assistance to enterprise-grade AI security. We're excited to see what you build with this latest addition.

Here’s an overview of what makes Gemma 4 12B unique:

Novel unified architecture: No multimodal encoders. The vision and audio inputs flow directly into the LLM backbone.
Advanced reasoning: Benchmark performance nearing our 26B model, unlocking powerful multi-step reasoning and agentic workflows.
Laptop ready: Small enough to run locally with just 16GB of VRAM or unified memory.
Open and accessible: Released under an Apache 2.0 license with support across the developer ecosystem.
Drafter-ready: Gemma 4 12B comes equipped with Multi-Token Prediction (MTP) drafters to reduce latency.

Together, these features bring advanced multimodal capabilities to everyday hardware without sacrificing speed or reasoning. Let's now take a closer look at how Gemma 4 12B achieves this.

Run state-of-the-art agents locally

Gemma 4 12B delivers performance nearing our larger 26B MoE model on standard benchmarks, but at less than half the total memory footprint. Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine.

Gemma 4 12B：一款统一的、无编码器多模态模型 Gemma 4 12B: A unified, encoder-free multimodal model

Run state-of-the-art agents locally

Gemma 4 12B：一款统一的、无编码器多模态模型
Gemma 4 12B: A unified, encoder-free multimodal model