Architecture Optimized for Flash-Speed Decoding and Inference
The architecture of Step 3.5 Flash is defined by a model-system co-design that prioritizes inference cost and speed as the core architectural constraint. We employ a Sparse Mixture-of-Experts (MoE) backbone to decouple global model capacity from per-token computation. While the total knowledge base spans 196B parameters, the system only activates 11B parameters per token during inference. To further reduce memory overhead, we strategically utilize dense layers for the first few layers of the network for high intelligence density.
To navigate the quadratic bottleneck of long-context processing, we leverage a hybrid attention layout that interleaves Sliding-Window Attention (SWA) with Full Attention at a 3:1 ratio. We specifically opted for SWA over linear alternatives to maintain the architectural flexibility required for speculative decoding. SWA is inherently compatible with Multi-Token Prediction (MTP) heads. These heads predict additional future tokens in parallel with the primary output, enabling parallel verification. This allows the model to validate multiple token hypotheses in a single pass, effectively breaking the serial constraints of standard autoregressive decoding.
To ensure this lightweight hybrid structure retains peak performance, we implemented two critical enhancements. We utilized an augmented query-head count in the SWA layers—increasing from 64 to 96—to strengthen representational power without expanding the \(KV\) cache footprint. This modification is highly efficient: since the attention window is fixed, the computational cost of these additional heads remains constant regardless of total sequence length. This allows us to scale up model expressiveness without the "long-context penalty" where attention costs usually explode as the conversation grows. Complementing this is our Head-wise Gated Attention, which functions as an input-dependent attention sink. By dynamically modulating information flow, this mechanism preserves numerical stability while incurring negligible overhead.
These strategic architectural refinements demonstrate that frontier-level reasoning can be decoupled from prohibitive latency. By integrating sparse-active execution with concurrent token verification, the model achieves a decoding throughput up to 350 tokens per second (TPS) on NVIDIA Hopper GPUs while running SWE-bench Verified.
Last but not least, the optimized total parameter scale of Step 3.5 Flash facilitates highly accessible, local inference. By consolidating its total capacity to a scale compatible with high-end personal hardware, the model supports high-fidelity private deployment on workstations such as the Apple M4 Max, NVIDIA DGX Spark, or AMD AI Max+ 395, providing a 100% trusted execution environment.
As the local deployment of large language models (LLMs) becomes increasingly prevalent, we have successfully adapted the Step 3.5 Flash to NVIDIA DGX Spark 128GB device based on the edge-side inference engine llama.cpp, and simultaneously released the INT4 quantized model weights in GGUF format. On NVIDIA DGX Spark, the Step 3.5 Flash achieves a generation speed of 20 tokens per second; by integrating the INT8 quantization technology for KVCache, it supports an extended context window of up to 256K tokens, thus delivering long text processing capabilities on par with cloud-based inference. The new model can be tested by developers on NVIDIA accelerated infrastructure via build.nvidia.com.
Scalable RL Unleashes the Reasoning Potential
We introduce a scalable reinforcement learning framework designed to reliably train reasoning and agentic language models at scale.
Modern RL pipelines for LLMs rely on high-throughput inference engines to generate rollouts, while optimization happens asynchronously in a separate training system. At scale, this setup introduces two compounding challenges:
- Training–inference mismatch, caused by numerical and architectural differences between systems
- Off-policy drift, as policies evolve while rollouts lag behind
For long reasoning sequences, even minor token-level discrepancies can explode into extreme importance weights—leading to unstable updates, early convergence, or complete training collapse.
To address this, we propose Metropolis Independence Sampling Filtered Policy Optimization (MIS-PO), which replaces fragile importance weighting with strict sample filtering. Instead of scaling gradients with continuous importance-sampling ratios as in PPO, MIS-PO uses these ratios solely as a binary acceptance criterion. Trajectories whose likelihood deviates too far between the inference and training policies are simply excluded from optimization, while accepted samples are treated as effectively on-policy. Concretely, the policy update is driven by
\[\mathcal{L}_{actor} = - \mathbb{E}_{\tau \sim \pi_{\theta_\text{vllm}}} \left[ \mathbb{I}(\tau) \cdot \log \pi_\theta(a_t|s_t) \cdot \hat{A}_t \right],\]
where the binary indicator \(\mathbb{I}(\tau)\) filters out off-distribution samples. This design dramatically reduces gradient variance and enables stable, long-horizon optimization without aggressive clipping.
Our framework also includes truncation-aware value bootstrapping, which prevents long reasoning trajectories from being incorrectly penalized when hitting context limits, and routing confidence monitoring for Mixture-of-Experts models, providing a practical signal for RL stability at scale.
Together, these components turn reinforcement learning into a reliable engine for continuous self-improvement, enabling consistent gains across mathematics, coding, and tool use, while remaining stable under large-scale, off-policy training.