I’ve been testing different models for agentic workflows lately, and I just came across a new release that solves a huge bottleneck: Speed vs. Intelligence.
It’s called Step 3.5 Flash by StepFun.
Usually, if you want a "smart" model (like for coding or complex reasoning), you have to deal with slow latency. If you want speed, you lose intelligence.
This model uses a Sparse Mixture-of-Experts (MoE) architecture to fix that.
Here are the specs that matter for us builders:
- Huge Brain, Light Footprint: It has 196B total parameters but only activates 11B per token.
- Insane Speed: It hits 350 tokens per second for coding tasks.
- Agent-First: It scored 74.4% on SWE-bench Verified, meaning it’s optimized for tool use and executing code, not just chatting.
- Runs Locally: You can actually run the Int4 version on a Mac Studio or a solid local rig using llama.cpp.
If you are building agents that need to "think and act" in real-time without burning cash on API latency, this is definitely worth a look.
Has anyone else tried running this locally yet? I’d love to see what kind of throughput you're getting.