T5Gemma 2：下一代编码器-解码器模型

T5Gemma 2：下一代编码器-解码器模型
T5Gemma 2: The next generation of encoder-decoder models

原始链接: https://blog.google/technology/developers/t5gemma-2/

## T5Gemma 2：新一代高效多模态模型 T5Gemma 2 在 T5Gemma 的成功基础上，融合了 Gemma 3 系列的创新，进一步发展了编码器-解码器架构。这一代产品引入了**多模态**能力——理解图像和文本，并显著**扩展了上下文窗口**，最高可达 128K tokens。 T5Gemma 2 效率的关键在于架构改进：**共享词嵌入**（编码器和解码器之间共享）和解码器中的**合并注意力**，从而减少了参数数量。这使得模型更加紧凑——2.7亿-2.7亿、10亿-10亿和40亿-40亿参数，非常适合设备端应用和快速实验。 T5Gemma 2 在一个庞大的多语言数据集上进行训练，支持超过 140 种语言，在各种任务中都表现出强大的性能，继承了 Gemma 3 的强大能力，同时提供了一个更易于访问和更通用的编码器-解码器解决方案。

## T5Gemma 2：一种新型编码器-解码器模型谷歌发布了T5Gemma 2，这是一种具有128K token上下文窗口的新一代编码器-解码器模型。虽然这些模型（参数范围从2.7亿到10亿+10亿）显示出潜力，但讨论的重点是缺乏发布的后训练检查点，这阻碍了更广泛的可访问性。关键点包括：T5Gemma 2利用“绑定嵌入”来提高效率，从而在较小的模型尺寸内实现更多功能。与仅解码器的模型（如Gemma）相比，编码器-解码器模型在诸如摘要和翻译之类的任务中效率更高，尤其是在微调和推理方面。它们擅长一次性理解整个输入。该架构与仅解码器的模型不同；编码器“理解”输入，而解码器生成输出。这种方法对于需要清晰理解和完成分离的任务是有益的，尽管仅解码器的模型通常因参数效率而在生成任务中更受欢迎。

原文

T5Gemma 2 is the next evolution of our encoder-decoder family based on Gemma 3, featuring the first multi-modal and long-context encoder-decoder models.

Unlike T5Gemma, T5Gemma 2 adopts tied word embeddings (over encoder and decoder) and merged decoder self- and cross-attention to save model parameters. It offers compact pre-trained models at sizes of 270M-270M (~370M total, excluding vision encoder), 1B-1B (~1.7B) and 4B-4B (~7B) parameters, making them ideal for rapid experimentation and deployment in on-device applications.

Background

With the original T5Gemma, we demonstrated that we could successfully adapt modern, pre-trained decoder-only models into an encoder-decoder architecture, unlocking new versatility. By initializing with weights from a powerful decoder-only model and then applying continued pre-training, we created high-quality, inference-efficient models while bypassing the computational cost of training from scratch.

T5Gemma 2 extends this into the realm of vision-language models by incorporating key innovations from Gemma 3.

What’s new

T5Gemma 2 is more than a re-training. It incorporates significant architectural changes while inheriting many of the powerful, next-generation features of the Gemma 3 family.

Architectural innovations for efficiency

To maximize efficiency at smaller scales, we have introduced key structural refinements:

Tied embeddings: We now tie the embeddings between the encoder and decoder. This significantly reduces the overall parameter count, allowing us to pack more active capabilities into the same memory footprint — crucial for our new compact 270M-270M model.
Merged attention: In the decoder, we adopt a merged attention mechanism, combining self- and cross-attention into a single, unified attention layer. This reduces model parameters and architectural complexity, improving model parallelization and benefiting inference.

Next-generation capabilities

Drawing from Gemma 3, T5Gemma 2 also represents a significant upgrade in model capabilities:

Multimodality: T5Gemma 2 models can understand and process images alongside text. By utilizing a highly efficient vision encoder, the models can seamlessly perform visual question answering and multimodal reasoning tasks.
Extended long context: We've dramatically expanded the context window. Leveraging Gemma 3's alternating local and global attention mechanism, T5Gemma 2 can handle context windows of up to 128K tokens.
Massively multilingual: Trained on a larger, more diverse dataset, these models now support over 140 languages out of the box.

Performance

T5Gemma 2 sets a new standard for what compact encoder-decoder models can achieve. Our new models demonstrate strong performance across key capability areas, inheriting the powerful multimodal and long-context features from the Gemma 3 architecture.