StepFun launched Step 3.5 Flash open-source model

Step 3.5 Flash, a new large language model from the Step team, is now available for deployment. This release is designed for developers and organizations seeking high-speed, cost-efficient language model inference, particularly those with requirements for local, private execution. The model is public and can be accessed for testing on NVIDIA’s accelerated infrastructure as well as locally on compatible hardware, including Apple M4 Max, NVIDIA DGX Spark, and AMD AI Max+ 395 workstations.

February is already starting crazy.

StepFun just dropped Step 3.5 Flash and it’s genuinely a serious open-source model.

Sparse MoE (196B total / 11B activated) + built for fast reasoning & agents.
256K context, 100-300 tok/s throughput, and the benchmark chart looks… pic.twitter.com/emyWDf0lns
— CHOI (@arrakis_ai) February 2, 2026

The architecture leverages a Sparse Mixture-of-Experts backbone, activating only 11B of its total 196B parameters per token, which drastically reduces compute and memory demands compared to earlier models. Step 3.5 Flash integrates a hybrid sliding-window/full attention scheme and introduces multi-token prediction heads, enabling parallel verification of multiple output tokens and supporting up to 350 tokens per second on NVIDIA Hopper GPUs. The model’s quantized INT4 weights in GGUF format and support for INT8 KVCache allow local inference with extended context windows up to 256K tokens, matching cloud-based model capabilities for long text processing.

0:00

/0:16

For an artistic weather dashboard that feels like a pilot's glass cockpit, create a 3D real Earth rendered via WebGL. Each country's major cities should have a glowing marker; clicking it zooms in on that region, shifting to a semi-transparent 2D overlay with detailed weather charts. Real-time data streamed via WebSockets with graceful fallback to cached snapshots.

The company’s new reinforcement learning framework, MIS-PO, addresses training-inference mismatches and off-policy drift, stabilizing long-horizon optimization. This approach, combined with truncation-aware value bootstrapping and routing confidence monitoring, provides robust training for advanced reasoning and agentic tasks.

💡

You can test Step 3.5 Flash on StepFun AI

Early technical feedback highlights the model's rapid throughput and efficient local deployment, with experts noting its potential to shift LLM workloads away from cloud dependence for sensitive applications.

Source