步骤3.5 Flash:速度快到足以思考。可靠到足以行动。
Step 3.5 Flash: Fast Enough to Think. Reliable Enough to Act

原始链接: https://static.stepfun.com/blog/step-3.5-flash/

## 步骤 3.5 Flash:快速且易于访问的语言模型 步骤 3.5 Flash 是一种 1960 亿参数的语言模型,专为速度和高效推理而设计,每个 token 仅激活 110 亿参数。它通过 **稀疏混合专家 (MoE)** 架构和 **混合注意力机制** 实现这一点,该机制结合了滑动窗口注意力 (SWA) 和全注意力,并针对 **推测解码** 进行了优化——并行预测和验证多个 token。 主要创新包括增加 SWA 层中的查询头数量,以增强表示而无需增加成本,以及 **头部门控注意力** 以实现数值稳定性。这在 NVIDIA Hopper GPU 上可实现高达每秒 350 个 token 的解码吞吐量。 重要的是,步骤 3.5 Flash 专为 **本地部署** 而设计,可在 Apple M4 Max 和 NVIDIA DGX Spark 等硬件上运行。它提供 INT4/INT8 量化格式,即使在边缘设备上也能实现 256K 上下文窗口。 一种新颖的 **强化学习 (RL) 框架**,利用 **Metropolis Independence Sampling Filtered Policy Optimization (MIS-PO)**,确保了推理能力的稳定且可扩展的训练,解决了训练-推理不匹配和脱离策略漂移的问题。这使得在数学、编码和工具使用等领域实现持续自我改进成为可能。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 工作 | 提交 登录 Step 3.5 Flash:足够快到可以思考。足够可靠到可以行动 (stepfun.com) 7 分,kristianp 发表于 1 小时前 | 隐藏 | 过去 | 收藏 | 2 条评论 帮助 wmf 发表于 26 分钟前 | 下一个 [–] 反向 x 轴确实令人困惑。回复 kristianp 发表于 1 小时前 | 上一个 [–] 最近的模型发布于几周前。“专家混合 (MoE) 架构,它每次处理一个 token 时仅激活 196B 参数中的 11B”。 在更多基准测试中胜过 Kimi K2.5 和 GLM 4.7。编辑:有 4 位量化版本可以在像 GB10 [1]、AI Max+ 395 或 mac studio 这样的 128GB 机器上运行。[1] https://forums.developer.nvidia.com/t/running-step-3-5-flash...回复 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

Architecture Optimized for Flash-Speed Decoding and Inference

The architecture of Step 3.5 Flash is defined by a model-system co-design that prioritizes inference cost and speed as the core architectural constraint. We employ a Sparse Mixture-of-Experts (MoE) backbone to decouple global model capacity from per-token computation. While the total knowledge base spans 196B parameters, the system only activates 11B parameters per token during inference. To further reduce memory overhead, we strategically utilize dense layers for the first few layers of the network for high intelligence density.

To navigate the quadratic bottleneck of long-context processing, we leverage a hybrid attention layout that interleaves Sliding-Window Attention (SWA) with Full Attention at a 3:1 ratio. We specifically opted for SWA over linear alternatives to maintain the architectural flexibility required for speculative decoding. SWA is inherently compatible with Multi-Token Prediction (MTP) heads. These heads predict additional future tokens in parallel with the primary output, enabling parallel verification. This allows the model to validate multiple token hypotheses in a single pass, effectively breaking the serial constraints of standard autoregressive decoding.

To ensure this lightweight hybrid structure retains peak performance, we implemented two critical enhancements. We utilized an augmented query-head count in the SWA layers—increasing from 64 to 96—to strengthen representational power without expanding the \(KV\) cache footprint. This modification is highly efficient: since the attention window is fixed, the computational cost of these additional heads remains constant regardless of total sequence length. This allows us to scale up model expressiveness without the "long-context penalty" where attention costs usually explode as the conversation grows. Complementing this is our Head-wise Gated Attention, which functions as an input-dependent attention sink. By dynamically modulating information flow, this mechanism preserves numerical stability while incurring negligible overhead.

These strategic architectural refinements demonstrate that frontier-level reasoning can be decoupled from prohibitive latency. By integrating sparse-active execution with concurrent token verification, the model achieves a decoding throughput up to 350 tokens per second (TPS) on NVIDIA Hopper GPUs while running SWE-bench Verified.

Last but not least, the optimized total parameter scale of Step 3.5 Flash facilitates highly accessible, local inference. By consolidating its total capacity to a scale compatible with high-end personal hardware, the model supports high-fidelity private deployment on workstations such as the Apple M4 Max, NVIDIA DGX Spark, or AMD AI Max+ 395, providing a 100% trusted execution environment.

Architecture

The overall architecture of Step 3.5 Flash.

As the local deployment of large language models (LLMs) becomes increasingly prevalent, we have successfully adapted the Step 3.5 Flash to NVIDIA DGX Spark 128GB device based on the edge-side inference engine llama.cpp, and simultaneously released the INT4 quantized model weights in GGUF format. On NVIDIA DGX Spark, the Step 3.5 Flash achieves a generation speed of 20 tokens per second; by integrating the INT8 quantization technology for KVCache, it supports an extended context window of up to 256K tokens, thus delivering long text processing capabilities on par with cloud-based inference. The new model can be tested by developers on NVIDIA accelerated infrastructure via build.nvidia.com.

Scalable RL Unleashes the Reasoning Potential

We introduce a scalable reinforcement learning framework designed to reliably train reasoning and agentic language models at scale.

Modern RL pipelines for LLMs rely on high-throughput inference engines to generate rollouts, while optimization happens asynchronously in a separate training system. At scale, this setup introduces two compounding challenges:

  1. Training–inference mismatch, caused by numerical and architectural differences between systems
  2. Off-policy drift, as policies evolve while rollouts lag behind

For long reasoning sequences, even minor token-level discrepancies can explode into extreme importance weights—leading to unstable updates, early convergence, or complete training collapse.

To address this, we propose Metropolis Independence Sampling Filtered Policy Optimization (MIS-PO), which replaces fragile importance weighting with strict sample filtering. Instead of scaling gradients with continuous importance-sampling ratios as in PPO, MIS-PO uses these ratios solely as a binary acceptance criterion. Trajectories whose likelihood deviates too far between the inference and training policies are simply excluded from optimization, while accepted samples are treated as effectively on-policy. Concretely, the policy update is driven by

\[\mathcal{L}_{actor} = - \mathbb{E}_{\tau \sim \pi_{\theta_\text{vllm}}} \left[ \mathbb{I}(\tau) \cdot \log \pi_\theta(a_t|s_t) \cdot \hat{A}_t \right],\]

where the binary indicator \(\mathbb{I}(\tau)\) filters out off-distribution samples. This design dramatically reduces gradient variance and enables stable, long-horizon optimization without aggressive clipping.

Our framework also includes truncation-aware value bootstrapping, which prevents long reasoning trajectories from being incorrectly penalized when hitting context limits, and routing confidence monitoring for Mixture-of-Experts models, providing a practical signal for RL stability at scale.

Together, these components turn reinforcement learning into a reliable engine for continuous self-improvement, enabling consistent gains across mathematics, coding, and tool use, while remaining stable under large-scale, off-policy training.

RL Algorithm Ablation

Training dynamics of different RL algorithms. Ablations are conducted on the Qwen model.

联系我们 contact @ memedata.com