Sonnet 4.6 Released! ā Benchmark Breakdown
Anthropic released Sonnet 4.6 today. Here's what changed and why it's worth paying attention to. The biggest jump: Novel problem-solving ARC-AGI-2 measures how well a model can reason through problems it hasn't seen before ā generalization, not memorization. - Sonnet 4.5: 13.6% - Sonnet 4.6: 58.3% - Increase: +44.7 percentage points That's the largest single-generation improvement in the table by a wide margin. Agentic benchmarks The benchmarks most relevant to tool use and automation all improved significantly: - Agentic search (BrowseComp): 43.9% ā 74.7% (+30.8pp) - Scaled tool use (MCP-Atlas): 43.8% ā 61.3% (+17.5pp) - Agentic computer use: 61.4% ā 72.5% (+11.1pp) - Terminal coding: 51.0% ā 59.1% (+8.1pp) Sonnet 4.6 vs Opus 4.5 Worth noting ā Sonnet 4.6 now outperforms Opus 4.5 on several benchmarks: - Novel problem-solving: 58.3% vs 37.6% - Agentic search: 74.7% vs 67.8% - Agentic computer use: 72.5% vs 66.3% Sonnet is the smaller, cheaper model tier ā so this shifts the cost/performance equation for anyone building agentic workflows. What this means practically If you're building with tool use, MCP integrations, or multi-step AI workflows, the MCP-Atlas and BrowseComp improvements are the ones to watch. Models that reliably use tools and follow through on multi-step tasks open up a lot of what was previously too brittle to ship.