Devstral Small 2 Scores 68% on SWE-bench Verified — From a MacBook

Most agentic coding models worth running require hardware that costs more than a used car. Devstral Small 2 makes a different bet — 24 billion parameters, Apache 2.0, 68% on SWE-bench Verified, and it loads on a single consumer GPU. Mistral is arguing that local agentic coding doesn't need to be a four-GPU affair.

What This Model Actually Is

Dense transformer. Not MoE — what you download is what runs. 256K-token context window. Mistral trained it on roughly twelve terabytes of code-heavy data, then applied multi-stage RLHF targeting agentic workflows: codebase navigation, multi-file editing, dependency tracking.

Its bigger sibling, Devstral 2, weighs 123B and demands a four-GPU H100 cluster. The Small variant keeps most of the benchmarks at a fraction of the compute.

68% In Context

Here's where Devstral Small 2 lands among models you can actually self-host:

Model	Parameters	SWE-bench Verified	Hardware Floor	License
Devstral 2	123B dense	72.2%	4×H100	Modified MIT
Devstral Small 2	24B dense	68.0%	RTX 4090 / Mac 32GB	Apache 2.0
Codestral 2	22B dense	—	RTX 3060 12GB	Apache 2.0
Qwen3-Coder-Next	3B active (62B total)	—	8GB VRAM	Apache 2.0

Don't confuse it with Codestral 2, despite the similar parameter counts. Codestral is Mistral's IDE completion model — 95.3% FIM pass@1, best in class for tab-complete suggestions. Devstral is the agentic side: plan, navigate, edit across files. You'd run Codestral inside your editor and Devstral in your terminal.

The gap between Small (68%) and full Devstral 2 (72.2%) is four points. The gap in hardware cost is roughly 5×. For most teams, that math resolves itself.

Running It

With Ollama:

ollama pull devstral-small-2
ollama run devstral-small-2

The Q4_K_M quantization occupies around 14GB of memory — fits on a 32GB Mac with room to spare. Q8_0 pushes to ~24GB, so that's RTX 4090 territory. On a 16GB machine, the Q4 variant technically loads but your effective context window shrinks dramatically. Budget 32GB minimum for serious use.

The more interesting deployment pairs it with Mistral Vibe, their open-source terminal agent:

pip install mistral-vibe
mistral-vibe --model devstral-small-2 --profile accept-edits

Vibe 2.0 ships with file manipulation, stateful shell execution, recursive code search, git integration, and MCP support. Four profiles out of the box: default (asks permission for every tool call), plan (read-only exploration), accept-edits (auto-approves file writes), auto-approve (full autonomy). Start with accept-edits until you've calibrated your trust.

Subagent delegation is the standout addition — Vibe can spawn child agents for focused subtasks, keeping the main context clean. Genuinely useful for larger repos where context pollution kills coherence fast.

Where the 32% Hurts

Single-file work is solid. Bug fixes, refactoring, test generation, code review — Devstral Small 2 handles these reliably. Ask it to add pytest coverage for a module and the generated tests run without hand-editing more often than not.

Multi-file orchestration in large codebases is where the 24B ceiling shows. Import chains spanning five-plus files, cross-module side effects, and architecture decisions that require holding the full dependency graph in working memory — these push past what the model tracks reliably.

Where GLM-5.1 was built for eight-hour autonomous sessions backed by its 744B MoE architecture, Devstral Small 2 loses the thread after extended multi-step chains. It's a focused sprinter, not a marathoner.

The License Angle

Devstral Small 2 ships Apache 2.0. No user caps, no commercial restrictions, no "non-production" caveats. Fine-tune on proprietary code, embed in a commercial product, deploy in an air-gapped environment — all permitted. The full Devstral 2 uses modified MIT, which adds restrictions. If you need unrestricted rights, the Small variant is the only Devstral option.

68% on SWE-bench from your own hardware, zero API calls, zero data exfiltration risk. The ceiling is real — complex multi-file orchestration still needs heavier models or human oversight. But for the focused, single-session coding tasks that fill most of a developer's day, this is the most capable model you can run without a cloud account.

#What This Model Actually Is

#68% In Context

#Running It

#Where the 32% Hurts

#The License Angle

What This Model Actually Is

68% In Context

Running It

Where the 32% Hurts

The License Angle