奥莫3:绘制模型流程图,引领开源人工智能
Olmo 3: Charting a path through the model flow to lead open-source AI

原始链接: https://allenai.org/blog/olmo3

## Olmo 3:一个完全开放和可追溯的语言模型家族 AI2 发布了 Olmo 3,一个开源语言模型家族(7B 和 32B 参数),旨在通过完全透明来推进开放式人工智能研究。与仅仅发布模型权重不同,Olmo 3 提供了对*整个*模型开发流程的访问权限——包括数据集、代码、检查点和训练过程——从而实现定制化和更深入的理解。 该家族包括 Olmo 3-Base(最强大的完全开放基础模型)、Olmo 3-Think(领先的开放推理模型,具有可追溯的推理步骤)、Olmo 3-Instruct(具有竞争力的聊天/工具使用模型)和 Olmo 3-RL Zero(用于强化学习研究)。Olmo 3-Think (32B) 显著缩小了与更大、闭源模型之间的性能差距,同时使用了明显更少的训练数据。 主要特性包括增强的数据策展(Dolma 3 和 Dolci 数据集)、高效的训练堆栈以及与 OlmoTrace 的集成——一种将模型输出追溯到训练数据的工具。这允许研究人员检查模型表现出某种行为的*原因*,并在开发的任何阶段进行干预。所有组件均以宽松的开源许可发布,从而促进协作和负责任的人工智能发展。

## Olmo 3:一种新的开源AI模型 - 摘要 Allen AI 发布了 Olmo 3,一个全新的开源 AI 模型家族(7B 和 32B 参数),专注于透明度。与许多“开放”模型不同,Olmo 提供对其训练数据和代码的访问,允许对其行为进行更严格的审查和理解。一个关键特性是“OlmoTrace”,它试图展示模型输出与其训练数据中特定文本片段之间的联系——尽管其用处存在争议,一些人认为它更像是字典查找,而非真正的推理追踪。 讨论强调了人们对训练数据质量的担忧(包括从各种来源抓取的内容),模型“幻觉”或自信地陈述错误信息的倾向,以及与 LM Studio 等工具的集成问题。尽管存在这些挑战,许多人仍然对真正开源的模型在研究和开发方面的潜力感到兴奋,并且认为它与闭源替代方案相比,能够提供更大的用户控制和理解。此次发布被视为一个进步,但需要进一步的开发和社区贡献才能充分发挥其潜力。
相关文章

原文

Language models are often treated as snapshots—brief captures of a long and carefully curated development process. But sharing only the end result obscures the rich context needed to modify, adapt, and extend a model's capabilities. Many meaningful adjustments require integrating domain-specific knowledge deep within the development pipeline, not merely at the final stage. To truly advance open AI development and research, the entire model flow – not just its endpoint – should be accessible and customizable. The model flow is the full lifecycle of an LM: every stage, checkpoint, dataset, and dependency required to create and modify it. By exposing this complete process, the goal is to engender greater trust and enable more effective adaptation, collaboration, and innovation.

With today's release of Olmo 3, we're empowering the open source community with not only state-of-the-art open models, but the entire model flow and full traceability back to training data.

At its center is Olmo 3-Think (32B), the best fully open 32B-scale thinking model that for the first time lets you inspect intermediate reasoning traces and trace those behaviors back to the data and training decisions that produced them. Olmo 3 is a family of compact, dense models at 7 billion and 32 billion parameters that can run on everything from laptops to research clusters.

  • Olmo 3-Base (7B, 32B) is our most powerful base model yet. When evaluated on our expanded, diverse evaluation suite, Olmo 3-Base delivers the strongest performance among fully open base models – where training data, code, and weights are all publicly available, like Stanford's Marin and Swiss AI's Apertus – and achieves competitive performance with some of the best open-weights base models of comparable size and architecture, including Qwen 2.5 and Gemma 3. Achieving strong results in programming, reading comprehension, and math problem solving, Olmo 3-Base maintains performance at extended context lengths (~up to 65K tokens)—providing a versatile foundation for continued pretraining, targeted fine-tuning, and reinforcement learning and making it easy to build in specialized capabilities like reasoning, tool use (function calling), and instruction following through post-training.
  • Olmo 3-Think (7B, 32B) is our flagship post-trained reasoning set built on Olmo 3-Base. At a time when few organizations are releasing truly open models at this scale, Olmo 3-Think (32B) serves as a workhorse for RL research, long-horizon reasoning, and other advanced experiments that require substantial compute. On our suite of reasoning benchmarks (discussed below), it's the strongest fully open thinking model we're aware of, narrowing the gap to the best open-weight models of similar scale – such as Qwen 3 32B – while training on roughly 6x fewer tokens. Olmo 3-Think (7B) brings the same design and training approach to an even more efficient form factor, surfacing intermediate thinking steps for complex prompts while making open, inspectable reasoning accessible on more modest hardware.
  • Olmo 3-Instruct (7B) is a chat and quick-response focused post-train of Olmo 3-Base that handles multi-turn, instruction-following, tool use, and more. In our evaluations, it matches or outperforms open-weight models including Qwen 2.5, Gemma 3, and Llama 3.1, and narrows the gap with Qwen 3 model families at a similar scale—delivering a strong, fully open alternative for high-quality conversational and tool-using agents.
  • Olmo 3-RL Zero (7B), is a fully open reinforcement learning pathway built on Olmo 3-Base, designed to bootstrap complex reasoning behaviors and enable clear benchmarking of RL algorithms. We release four series of checkpoints from domain-focused training on math, code, instruction following, and general chat, enabling careful study of reinforcement learning with verifiable rewards (RLVR).

Instead of a single set of frozen weights, Olmo 3 offers multiple, fully documented paths through development: the Instruct path for everyday chat and tool use, the RL Zero path for RL experimentation from base models, and the Think/reasoning path for models that leverage inference-time scaling to unlock complex reasoning and agentic behaviors. Each path is a concrete example of how to shape behavior from the same base model, and you’re free to fork or remix them—start with Olmo 3-Base, explore your own supervised fine-tuning (SFT) or direct preference optimization (DPO) recipe for instruct-style use cases, or plug in a new RL objective to probe different tradeoffs. The flow itself becomes a rich, reusable object—not just a record of how we built Olmo 3, but a scaffold for how you can build your own systems.

Olmo 3 Model FlowPretrainingMidtrainingLong contextOlmo 3 BaseInstruct SFTInstruct DPOInstruct RLOlmo 3 InstructThinking SFTThinking DPOThinking RLOlmo 3 ThinkRL ZeroOlmo 3 RL ZeroOlmo 3 Model FlowPretrainingMidtrainingLong contextOlmo 3 BaseInstruct SFTInstruct DPOInstruct RLOlmo 3 InstructThinking SFTThinking DPOThinking RLOlmo 3 ThinkRL ZeroOlmo 3 RL Zero

Click on any stage to learn more about it and download artifacts.

The Olmo 3 checkpoints we're releasing represent our initial paths targeting our goals around reasoning, tool use, and general capabilities – we have exciting plans for other ways to leverage Olmo 3-Base 32B. But because we're releasing the entire flow, you can intervene at any point: swap in domain-specific data during mid-training, adjust post-training for your use case, or build on an earlier checkpoint that better suits your needs. 

As with Olmo and Olmo 2, we’re releasing all components of the Olmo 3 flow – data, code, model weights, and checkpoints – under permissive open source licenses.  

Try Olmo 3 | Download the models & data | Read the report

Strong performance across the board

We run the Olmo 3 checkpoints through a broad, updated benchmark suite, grouping dozens of industry-standard tasks (plus a few new ones we introduce) into several capability clusters. Together, the clustered suite and these held-out tasks give us a capability profile of Olmo 3—a clear picture of how well it solves math problems, codes, uses tools, answers general-knowledge questions, and more. 

At a high level, the Olmo 3 family delivers the strongest fully open base and thinking models we’re aware of. Olmo 3-Base 32B outperforms other fully open base models, and Olmo 3-Think 32B emerges as the strongest fully open thinking model.

Our results were made possible by rigorous data curation at every stage of training, a carefully designed training recipe for each model, and a set of new algorithmic and infrastructure advances across data processing, training, and reinforcement learning. We also introduce an enhanced reinforcement learning framework that guides the development of our models and is particularly essential for our thinking models. To design the training recipe and coordinate targeted improvements across a wide range of capabilities at each stage of the model training pipeline, our development framework balances distributed innovation with centralized evaluation.

Olmo 3-Base, with a training pipeline that first focuses on broad coverage over diverse text, code, and math, then concentrates on harder distributions to sharpen programming, quantitative reasoning, and reading comprehension, is clearly the strongest set of fully open base models in our evaluations. It’s also arguably the best 32B model in the entire ecosystem of models with open weights, performing impressively in programming, reading comprehension, math problem solving, and long-context benchmarks like RULER, which tests information retrieval from lengthy texts. Olmo 3-Base (7B) and Olmo 3-Base (32) maintain quality at extended context lengths and integrate cleanly with RL workflows, providing a robust foundation for continued pretraining and post-training.

Olmo 3-Think, which turns the Base into a reasoning model by training on multi-step problems spanning math, code, and general problem solving, then running the thinking SFT → thinking DPO → RLVR model flow to elicit high-quality reasoning traces, competes with or exceeds several open-weight reasoning models of similar sizes. On math benchmarks, Olmo 3-Think (7B) matches Qwen 3 8B on MATH and comes within a few points on AIME 2024 and 2025, and also leads all comparison models on HumanEvalPlus for coding—performing strongly on MBPP and LiveCodeBench to demonstrate particular strength in code-intensive reasoning. On broader reasoning tasks like BigBench Hard and AGI Eval English, Olmo 3-Think (7B) remains competitive with Qwen 3 8B reasoning and Qwen 3 VL 8B Thinker while staying fully open and slightly smaller. 

For the 32B model, Olmo 3-Think scales these trends up and becomes one of the strongest fully open reasoning models in its class. Olmo 3-Think (32B) either wins or sits within roughly two points of the best open-weight model on MATH, OMEGA, BigBenchHard, HumanEvalPlus, PopQA, and IFEval. It ties Qwen 3 VL 32B Thinking for the top score on the OMEGA suite while staying clearly ahead of Gemma 3 27B Instruct and competitive with DeepSeek R1 Distill 32B on math and reasoning. On broader knowledge and QA, Olmo 3-Think (32B) is effectively neck-and-neck with the Qwen 3 models on PopQA. And in instruction following, Olmo 3-Think (32B) tops this subset on IFEval and remains solid on IFBench and AlpacaEval 2 LC—offering a strong default for reasoning workloads at the 32B scale.

Olmo 3-Instruct, which produces shorter sequences than the corresponding Olmo 3-Think models to improve inference efficiency and is designed to focus on general chat, tool use, and synthetic data generation, outperforms comparably-sized open-weight models. Olmo 3-Instruct ties or surpasses Qwen 2.5, Gemma 3, and Llama 3.1 in our evaluations, and competes with the Qwen 3 family at similar scale, delivering strong function calling performance and instruction-following capabilities in a fully open 7B model.

The Olmo 3 architecture and training stages

Olmo 3 uses a decoder-only transformer architecture and multi-stage training pipeline. Pretraining runs in three stages—an initial large-scale training run that builds broad capabilities; a mid-training phase that focuses on harder material like math, code, and reading comprehension; and a final long-context extension stage that trains the model on very long documents. Together with architectural enhancements, this yields a more capable, efficient base for the Olmo 3 family.

Post-training then specializes the pretrained model for different use cases. Building on Olmo 2, each pathway follows a three-stage recipe – SFT, preference tuning with DPO, and RLVR – but in Olmo 3, we expose this as a fully documented model flow with complete customization over each training stage and dataset mix.

Instead of releasing only the final weights, we provide checkpoints from each major training milestone: the base pretrained model, the mid-trained model after targeted skill enhancement, the long-context-extended version, plus post-training checkpoints for the Olmo 3-Think, Olmo 3-Instruct, and Olmo 3-RL Zero flows. You can study how capabilities emerge over time, run ablations on specific stages, and fork the model at whatever point best fits your data, compute, and goals.

Expanded training data

Compared to Olmo 2, we scaled data collection and significantly strengthened our dataset curation methods. Continuing our commitment to full transparency, we’re releasing several new, higher-quality datasets that cover every stage of base model training and post-training—from initial learning to specialized skills like complex reasoning and long-context understanding. This means anyone can see exactly what data shaped the model’s capabilities, reproduce our results, and reuse these datasets to train their own AI systems.

Olmo 3 is pretrained on Dolma 3, a new ~9.3-trillion-token corpus drawn from web pages, science PDFs processed with olmOCR, codebases, math problems and solutions, and encyclopedic text. From this pool, we construct Dolma 3 Mix, a 5.9-trillion-token (~6T) pretraining mix with a higher proportion of coding and mathematical data than earlier Dolma releases, plus much stronger decontamination via extensive deduplication, quality filtering, and careful control over data mixing. We follow established web standards in collecting training data and don’t collect from sites that explicitly disallow it, including paywalled content.

On top of this, we introduce two Dolma 3-based mixes for later stages of base model training. Dolma 3 Dolmino is our mid-training mix: 100B training tokens sampled from a ~2.2T-token pool of high-quality math, science, code, instruction-following, and reading-comprehension data, including reasoning traces that also enable RL directly on the base model. Dolma 3 Longmino is our long-context mix: ~50B training tokens drawn from a 639B-token pool of long documents combined with mid-training data to teach Olmo 3 to track information over very long inputs (like reports, logs, and multi-chapter documents).

We also introduce Dolci, a new post-training data suite tailored specifically for reasoning, tool use, and instruction following. Dolci provides separate mixes for each stage of post-training: SFT, DPO, and RLVR. For SFT, Dolci aggregates state-of-the-art datasets that advance step-by-step reasoning, tool use, and high-quality conversational behavior; for DPO, it supplies high-quality contrastive preference data; and for RL, it includes hard, diverse prompts across math, coding, instruction following, and general chat. 

Together, Dolma 3 and Dolci give Olmo 3 a fully open data curriculum from first token to final post-trained checkpoint.

Efficient training stack

We pretrained Olmo 3 on a cluster of up to 1,024 H100 GPUs; we achieved training throughput of 7.7K tokens per device per second for Olmo 3-Base (7B). We mid-trained on 128 H100 GPUs, and post-trained on a set of 256 H100s.

For Olmo 3, building on the work we did for Olmo 2, we were able to significantly improve the efficiency of our post-training code. By moving SFT from Open Instruct (our post-training codebase, prioritizing flexibility) to Olmo Core (our pretraining codebase, designed to maximize efficiency), we increased throughput (tokens/second) by 8x. Similarly, by incorporating in-flight weight updates, continuous batching, and a lot of threading improvements, we made our RL training 4x more efficient—resulting in training runs that are significantly cheaper and faster. 

A note on our 32B models: We believe 32B sits in a sweet spot for research and tinkering. 32B models are big enough to support strong, competitive performance, but still small enough that a wide audience can fine-tune and deploy them on accessible hardware.

For more details, including ablations, please read our technical report

Transparency at the core

A core goal of Olmo 3 is not just to open the model flow, but to make it actionable for people who want to understand and improve model behavior. Olmo 3 integrates with OlmoTrace, our tool for tracing model outputs back to training data in real time.

For example, in the Ai2 Playground, you can ask Olmo 3-Think (32B) to answer a general-knowledge question, then use OlmoTrace to inspect where and how the model may have learned to generate parts of its response. This closes the gap between training data and model behavior: you can see not only what the model is doing, but why—and adjust data or training decisions accordingly.

To further promote transparency and explainability, we’re making every training and fine-tuning dataset available for download, all under a permissive license that allows for custom deployment and reuse. The datasets come in a range of mixes to accommodate different storage and hardware constraints, from several billion tokens all the way up to 6 trillion.

Our new tooling for data processing allows you to de-contaminate, tokenize, and de-duplicate data in the same way we did for Olmo 3’s corpora. All the tooling is open source, enabling you to replicate our training curves or run controlled ablations across data mixes and objectives. 

Our Olmo utilities and software cover the whole development cycle:

  • Olmo-core is a state-of-the-art framework for distributed model training.
  • Open Instruct is our post-training pipeline. 
  • datamap-rs is a pure-Rust toolkit for large-scale cleaning.
  • duplodocus for ultra-efficient fuzzy de-duplication.
  • OLMES is a toolkit for reproducible evals. It includes our brand-new eval collection OlmoBaseEval, which we used for Olmo 3 base model development.
  • decon removes test sets from training data.

Importantly, our tooling allows you to instrument complex tasks and analyze intermediate traces to understand where the models succeed—or struggle. Because the Olmo 3 data recipes, training pipeline, and checkpoints are open, independent teams can connect model behavior back to measurable properties. 

Ready to deploy and use

Together, the Olmo 3 family makes it easier to build trustworthy features quickly, whether for research, education, or applications. By making every development step available and inspectable, we're enabling entirely new categories of research. You can run experiments on any training phase, understand exactly how different techniques contribute to model capabilities, and build on our work at whatever stage makes sense for your project.

For scientists, the fully open flow exposes the model’s inner workings, so you can instrument experiments across coding, reasoning, RL, and tool use. 

If you care about AI you can study, audit, and improve, Olmo 3 is for you. Try the demos in the Ai2 Playground, explore the documentation, and build on the released weights and checkpoints. Then tell us what you discover—we invite the community to validate, critique, and extend our findings.

True openness in AI isn't just about access—it's about trust, accountability, and shared progress. We believe the models shaping our future should be fully inspectable, not black boxes. Olmo 3 represents a different path: one where anyone can understand, verify, and build upon the AI systems that increasingly influence our world. This is what open-first means—not just releasing weights, but sharing the complete knowledge needed to advance AI responsibly: the flow.

Try Olmo 3 | Download the models & data | Read the report

联系我们 contact @ memedata.com