三位一体大型：一个开放的400B稀疏MoE模型

三位一体大型：一个开放的400B稀疏MoE模型
Trinity large: An open 400B sparse MoE model

原始链接: https://www.arcee.ai/blog/trinity-large

## Trinity-Large：开源AI新前沿经过两个月的密集开发，团队发布了Trinity-Large，一个400B参数的稀疏混合专家（MoE）模型，以及两个附加变体：Trinity-Large-Base（真正的基础模型）和Trinity-Large-Preview（即用型聊天模型）。该项目耗资约2000万美元，代表着在可访问、高性能AI方面迈出了重要一步。 Trinity-Large拥有独特的架构，具有高稀疏性（1.56%的激活参数），能够实现更快、更高效的训练和推理——比同类模型快大约2-3倍。它在17T个策划数据上进行训练，在数学、编码和推理等领域实现了前沿水平的性能，匹配或超越了现有的开源模型。发布的*Preview*版本优先考虑在创意任务和代理应用中的实用性，而*Base*模型则为研究人员提供了一个干净的检查点，用于研究预训练的影响。团队利用了基于动量的专家负载均衡和z-loss等创新技术来稳定训练。Trinity-Large-Preview目前在OpenRouter上免费提供，计划进行完全发布和进一步改进。此次发布旨在赋予开源社区一个强大、可拥有且处于前沿水平的模型。

## Trinity Large：一种新的开源AI模型 Arcee.ai 发布了 Trinity Large，一个400B参数的稀疏混合专家（MoE）模型，训练耗时33天，成本约为2000万美元。该模型性能接近QWEN和Deepseek，尽管每个token只使用13B激活参数。讨论的重点在于训练方法的权衡。虽然更大的模型通常需要更多资源，但Trinity Large优先考虑每参数的性能，可能使其在推理方面更有效率。然而，一些人认为由于激活参数数量和总token数量较低（17T vs 20-30T+），它与GLM和Qwen等竞争对手相比训练不足。用户对拥有一个用于研究的“真正的基础”模型以及将其提炼成其他架构的潜力感到兴奋。该模型以Apache-2.0许可提供权重，但数据集不开放。一个关键问题是Arcee.ai计划如何将其开源模型的商业化。已经提供量化版本，以便更轻松地本地使用。

原文

Two months ago I wrote about why we decided to stop treating pretraining like someone else's job.

At the time, Trinity Nano Preview and Trinity Mini had just released, and Trinity Large had started training. We were in the middle of our first run so big that you either laughed or got nauseous. Frankly, I felt either we’d end up with a really great base model or fall flat on our faces with a tired wallet.

Little did I know, we’d get both.

Here’s what we’re shipping, what surprised us, what broke, and what it took to make a 400B sparse MoE behave.

We're putting out three variants: Trinity-Large-Preview is lightly post-trained and chat-ready, Trinity-Large-Base is our best pretraining checkpoint after the full 17T recipe, and TrueBase is an early checkpoint from the same run at 10T tokens, without any instruct data or LR anneals. What many would consider a true base model.

Trinity-Large is a 400B parameter sparse MoE with 13B active parameters per token. It uses 256 experts with 4 experts active per token. That sparsity ratio is pretty high compared to our peers, save for Llama-4-Maverick:

Model	Routing (k-of-N)	Routing fraction
Trinity Large	4-of-256	1.56%
DeepSeek-V3	8-of-256	3.13%
MiniMax-M2	8-of-256	3.13%
GLM-4.5	8-of-160	5.0%
Qwen3-235B-A22B	8-of-128	6.25%
Llama 4 Maverick	1-of-128	0.78%

We originally aimed for a slightly different total size (420B), but we ended up increasing the number of dense layers (from 3 to 6) to help keep routing stable at this sparsity.

Trinity-Large-Base is a true frontier-class foundation model. We match and exceed our peers in open-base models across a wide range of benchmarks, including math, coding, scientific reasoning, and raw knowledge absorption.

Inference efficiency

We trained on 2048 Nvidia B300 GPUs. As far as we can tell, it’s the largest (publicly stated, at least) pretraining run done on these machines. That means two things:

They’re wicked fast.
They’re not cheap.

Therefore, we had to make the most of the money we allotted to these machines, which was just over 30 days. Ridiculously fast for a run of this scale, so efficient training was the name of the game. Hence, the level of sparsity referred to above. Combined with our efficient attention outlined in our technical report, this enabled us to train and, by extension, run inference much faster than our peers in the same weight class. All while not sacrificing performance. Roughly 2-3x faster for the same hardware.

Momentum-based expert load balancing

We keep MoE routing under control by nudging each expert’s router bias up or down depending on whether that expert is being over- or under-used. The update is capped with a tanh clip so it stays bounded, and we add momentum to smooth it across steps and avoid step-to-step ping-pong. On top of that, we include a small per-sequence balance loss so load is not only balanced in expectation across the batch, but also within individual sequences.

z-loss

We use z-loss to stop the LM-head logits from drifting upward during training. It is a lightweight regularizer that keeps logit scale from creeping up without bound. We also log basic logit stats (for example max and mean) as a simple early warning for instability.

The exact equations are in the technical report.

Our fastest configuration was HSDP with expert parallelism set to 8, which gave us 2048 data-parallel ranks. In that setup, we pushed throughput further by increasing batch size after 5T tokens of training. We were comfortable doing that because the model is highly sparse, Muon supports a larger critical batch size than AdamW, and the MiniMax-01 paper suggests batch-size scaling remains workable in this regime.

In the main run, once we had stability dialed in, the loss curve stayed smooth the whole way through. You can see clear phase transitions, no spikes, and a steady march to the end.

The full pretraining run finished in 33 days. That's pre-training only; it doesn't include context extension or post-training.

Data

Trinity Large was trained on 17T tokens of data curated by DatologyAI, split across three phases of 10T, 4T, and 3T tokens. This mix uses state-of-the art programming, STEM, reasoning, and multilingual data curation, targeting 14 non-English languages. Notably, over 8 trillion tokens of synthetic data were generated for this dataset across web, code, math, reasoning, and multilingual domains, using a breadth of state-of-the-art rephrasing approaches.

A less advanced version of this curation approach worked well for smaller models like Trinity Nano and Trinity Mini, but we wanted to shoot for the moon with Trinity Large. So DatologyAI delivered a number of curation advancements specifically for inclusion into Trinity Large. There’s always a leap of faith when doing something for the first time in public, but the effectiveness of the curation approach as a whole is reflected in the downstream evaluations, where the Trinity Large-Base demonstrates frontier-level performance across the targeted capability domains.

Trinity-Large-Preview

The preview we’re releasing today is not a reasoning model. A benefit beyond production inference efficiency is that it also carries over into RL, enabling quicker rollouts for a given total parameter size. But it’s equally as sensitive as pretraining was, and while it’s currently undergoing further post-training and will be a full-fledged reasoning model upon release, we believe there’s a fine line between intelligence and usefulness, and while the reasoning variant is very intelligent, it needs longer in training before it becomes maximally useful given the extra tokens per output.

As such, to maximize usefulness and provide an early checkpoint, Preview is a non-reasoning, or “instruct,” model. It’s a particularly light post-training, as most of our compute went towards pre-training and is continuing with the reasoning version. It is an extremely capable model for its size and gave us the opportunity to flex some old-school post-training muscles we haven’t had the chance to flex in quite some time. It excels in creative writing, storytelling, role-play, chat scenarios, and real-time voice assistance, better than your average reasoning model usually can. But we’re also introducing some of our newer agentic performance. It was trained to navigate well in agent harnesses like OpenCode, Cline, and Kilo Code, and to handle complex toolchains and long, constraint-filled prompts. It certainly isn’t perfect, but we cannot wait to see what you do with it. It’s free in OpenRouter until Large (non-preview) fully releases.

Benchmark	Llama 4 Maverick	Trinity-Large Preview
MMLU	85.5	87.2
MMLU-Pro	80.5	75.2
GPQA-Diamond	69.8	63.3
AIME 2025	19.3	24.0

It’s currently roughly in line with Llama-4-Maverick’s Instruct model across standard academic benchmarks, and we’ll update this blog over time with more evaluations.

But we like to tease, so I’ll leave you with some early evaluations of the reasoning Trinity-Large. These are not exhaustive, and by no means representative of the full capabilities of any of these models, but it is a fun look at how we plan to take this model.

Cost

When we started this run, we had never pretrained anything remotely like this before.

There was no guarantee this would work. Not the modeling, not the data, not the training itself, not the operational part where you wake up, and a job that costs real money is in a bad state, and you have to decide whether to restart or try to rescue it.

All in—compute, salaries, data, storage, ops—we pulled off this entire effort for $20 million. 4 Models got us here in 6 months.

That number is big for us. It's also small compared to what frontier labs spend just to keep the lights on. We don't have infinite retries.

What is TrueBase?

One more thing about Trinity-Large-TrueBase.

Most "base" releases have some instruction data baked in. TrueBase doesn't. It's 10T tokens of pretraining on a 400B sparse MoE, with no instruct data and no LR annealing.

If you're a researcher who wants to study what high-quality pretraining produces at this scale—before any RLHF, before any chat formatting—this is one of the few checkpoints where you can do that. We think there's value in having a real baseline to probe, ablate, or just observe. What did the model learn from the data alone? TrueBase is where you answer that question.

Where to use it

OpenRouter has Trinity-Large-Preview available now, free during the preview period (through at least February 2026). If you want to kick the tires without spinning up infrastructure, that's the fastest path.

We also worked with Kilo Code, Cline, and OpenCode to have integrations ready at launch. If you're already using one of those for coding, Trinity Large should show up as an option. This is an extremely young post-train, in the scheme of RL runs that go on for months at a time, like our peers. We’ll get there soon, but we’re oh-so-proud to have our own model to do it with. Expect rough edges, specifically in coding agents. For everyday agents, though, it’s outstanding.

Context and hosting

Trinity Large natively supports 512k context.

The preview API is running at 128k context with 8-bit quantization as we tune our inference infrastructure. This release is as much a preview of our hosting platform as it is a model launch.

Try Trinity Large

If you put this model into something real and it breaks, tell us. The fastest way for open models to get better is for people to actually use them, hard, in places that don't look like benchmarks.

We like to say that we built Trinity so you can own it. Being able to say that about a frontier-level model is something we’re immeasurably proud of.