瑞恩脑 (Ruì'ēn nǎo)
RynnBrain

原始链接: https://github.com/alibaba-damo-academy/RynnBrain

## RynnBrain:一个具身基础模型 RynnBrain是一个新的具身基础模型,旨在理解和与物理世界互动。它已发布代码和模型检查点(2B、8B和30B-A3B版本),擅长需要详细视频理解、空间推理和精确规划的任务。 主要特点包括强大的第一人称理解(如具身问答和物体计数)、准确的时空定位,以及一种独特的交错推理方法,将语言与物理现实相结合。还提供专门的后训练模型,用于机器人任务规划(RynnBrain-Plan)、视觉语言导航(RynnBrain-Nav)和链式指向推理(RynnBrain-CoP)。 RynnBrain基于Qwen3-VL构建,采用统一的编码器-解码器架构,并经过大量时空和物理数据的训练。它使用新的RynnBrain-Bench基准进行评估,重点关注物体与空间认知、 grounding 和指向。 演示、cookbook 以及预训练/评估细节(通过RynnScale)均可获得。

``` Hacker News新帖 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录RynnBrain (github.com/alibaba-damo-academy)9点由 jsemrau 3小时前 | 隐藏 | 过去 | 收藏 | 讨论帮助 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索: ```
相关文章

原文

💫 Project Page   |    Models & Bench 🤗 🤖  |  🚀 Demo    |   📚 Cookbooks  

  • [2026.02.15] 🔥🔥 Release our Technical Report !!
  • [2026.02.09] 🔥🔥 Release our code and model checkpoints!!

We present RynnBrain, an embodied foundation model grounded in physical reality. RynnBrain is available in two dense variants (2B and 8B) and one mixture-of-experts (MoE) model (30B-A3B). In addition, we release three post‑trained models: RynnBrain‑Plan (robot task planning), RynnBrain‑Nav (vision-language navigation), and RynnBrain‑CoP (chain-of-point reasoning).

  • Comprehensive egocentric understanding: Excels in fine-grained video understanding and egocentric cognition, covering tasks such as embodied QA, counting, and OCR.
  • Diverse spatio-temporal localization: Possesses powerful localization capabilities across episodic memory, enabling precise identification of objects, target areas, and motion trajectories.
  • Physical-space reasoning: Employs an interleaved reasoning strategy that alternates between textual and spatial grounding, ensuring that its reasoning processes are firmly rooted in the physical environment.
  • Physics-aware precise planning: Integrates located affordances and object information into planning, enabling downstream VLA models to execute intricate tasks with fine-grained instructions.

RynnBrain employs a unified encoder-decoder architecture (supporting both Dense and MoE variants) to transform omni-vision inputs and textual instructions into multi-modal outputs, including spatial trajectories, physical pointing, and action planning.  Through massive training on rich spatio-temporal, physical-space, and general knowledge data, RynnBrain maintains robust general-purpose capabilities while specializing in diverse, fine-grained embodied reasoning and complex planning tasks.

  • General Embodied Understanding

  • Vision-Language Navigation

Model Base Model HuggingFace ModelScope
RynnBrain-2B Qwen3-VL-2B-Instruct Link Link
RynnBrain-8B Qwen3-VL-8B-Instruct Link Link
RynnBrain-30B-A3B Qwen3-VL-30B-A3B-Instruct Link Link
RynnBrain‑CoP-8B RynnBrain-8B Link Link
RynnBrain‑Plan-8B RynnBrain-8B Link Link
RynnBrain‑Plan-30B-A3B RynnBrain-30B-A3B Link Link
RynnBrain‑Nav-8B RynnBrain-8B Link Link

Minimal dependencies:

pip install transformers==4.57.1

Run text generation:

from transformers import AutoModelForImageTextToText

model = AutoModelForImageTextToText.from_pretrained("")
...

Checkout the cookbooks that showcase RynnBrain's capabilities in cognition, localization, reasoning, and planning.

Pretraining & Evaluation

Please refer to RynnScale for details of pretraining and evaluation.

Finetuning

  • Reasoning: RynnBrain introduces an interleaved reasoning approach that combines grounding with textual information directly within egocentric video streams. This paradigm effectively bridges the cognitive gap between language and the physical world, ensuring the reasoning process is robustly anchored in reality.

  • Navigation: We trained a vision-language navigation model based on the RynnBrain base model. Empirical evaluation demonstrates that fine-tuning the vision-language model on RynnBrain yields superior performance compared to fine-tuning on other foundational models.

  • Planning: RynnBrain integrates the location information of affordance, areas, and objects directly into its planning outputs. Consequently, even highly intricate and fine-grained tasks can be effectively addressed within our hierarchical RynnBrain-VLA system architecture.

We introduce RynnBrain-Bench, a high-dimensional benchmark for embodied understanding that evaluates models across four key dimensions: object cognition, spatial cognition, grounding, and pointing—highlighting fine-grained understanding and spatiotemporal localization across episodic video sequences.

For details, please refer to RynnBrain-Bench.

💡 Some other multimodal-LLM projects from our team may interest you ✨.

RynnEC: Bringing MLLMs into Embodied World
Ronghao Dang*, Yuqian Yuan*, Yunxuan Mao*, Kehan Li*, Jiangpin Liu, Zhikai Wang, Fan Wang, Deli Zhao, Xin Li
github github arXiv

RynnScale
RynnScale Team
github github

RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation
Yuming Jiang, Siteng Huang, Shengke Xue, Yaxi Zhao, Jun Cen, Sicong Leng, Kehan Li, Jiayan Guo, Kexiang Wang, Mingxiu Chen, Fan Wang, Deli Zhao, Xin Li
github github arXiv

RynnVLA-002: A Unified Vision-Language-Action and World Model
Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, Fan Wang, Deli Zhao, Hao Chen
github github arXiv

RynnRCP: Open Robotics Context Protocol and RobotMotion
RynnBot Team
github github

RynnMotion: All-In-One Toolkit for Fast Robot Prototyping and Heterogeneous Teleoperation
RynnBot Team
github github

Our RynnBrain is built on top of Qwen3-VL. We also learned a lot from the implementation of RynnEC and VideoRefer. If your work is used in RynnBrain but not mentioned in either this repo or the technical report, feel free to let us know ❤️.

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

联系我们 contact @ memedata.com