Mistral just shipped a model that replaces your instruct endpoint, your reasoning pipeline, and your vision stack — and the whole thing runs on the same inference budget as a dense 7B model per token.
One Checkpoint, Three Jobs
Mistral Small 4 fuses three previously separate model families into a single set of weights: instruct chat from Mistral's main line, reasoning from Magistral, and agentic coding from Devstral. The architecture underneath: 119B total parameters spread across 128 expert subnetworks, with 4 experts (~6B parameters) active per token. Apache 2.0. 256K context window. Text and image input.
The practical upshot: stop juggling model routing logic that decides "is this a reasoning task or a chat task?" and just send everything to one endpoint.
Where It Wins (and Where It Doesn't)
Benchmarks are noisy, but a few of these results are hard to ignore.
Output efficiency is the headline story. On the AA LCR benchmark, Small 4 scores 0.72 with just 1.6K characters of output. Qwen models need 5.8–6.1K characters for comparable scores — roughly 3.5 to 4x more text to reach the same quality. If you're paying per output token on an API, that gap is real money at scale.
LiveCodeBench: Small 4 beats GPT-OSS 120B while generating 20% fewer output tokens. Fewer tokens means lower latency and lower cost. This isn't a marginal improvement when you're processing thousands of requests per hour in production.
Coding and math benchmarks: Competitive with Claude Haiku 3.5 and Qwen 2.5 on standard evaluation suites. Not frontier-crushing numbers, but genuinely impressive when you consider this thing activates just 6B parameters per forward pass. It punches well above its active weight class.
Versus its predecessor: 40% lower end-to-end latency than Mistral Small 3 in latency-optimized configurations, and 3x higher throughput measured in requests per second. The generational jump is significant — if you're running the previous version today, the upgrade case practically makes itself.
Where it falls short: raw capability against the big guns. GLM-5, Qwen3.5-397B, and the larger Llama 4 variants all outperform on absolute benchmarks. They also all have considerably more active parameters per token. Small 4 deliberately trades peak scores for deployment efficiency — and that's the whole point.
Getting It Running
119B total parameters is a lot of weights to store on disk and in memory, even though inference only touches 6B of them per forward pass.
Unquantized: Plan for 4x H100 GPUs or 2x H200s. Enterprise territory — individuals and small startups need not apply at full precision.
Quantized via GGUF: Unsloth already has Q4_K_M quantized versions on HuggingFace. At that quantization level, the model fits in roughly 70–80GB of VRAM — manageable with a dual-GPU consumer setup or a single A100 80GB. Not laptop-friendly, but within reach for anyone with a dedicated inference rig.
Ollama — the path of least resistance:
ollama pull mistral-small-4
This exposes an OpenAI-compatible local API endpoint. If your stack already uses the OpenAI SDK, switching to a local deployment literally means changing one base URL. Nothing else in your application code needs to move.
vLLM — for production workloads: The officially recommended inference engine. The MoE architecture plays nicely with PagedAttention, and Mistral published a companion Eagle model (Mistral-Small-4-119B-2603-eagle on HuggingFace) specifically designed for speculative decoding, which pushes throughput even higher.
Why 128 Experts Matters
Most mixture-of-experts models in 2025 peaked at 8 to 16 expert subnetworks. Going to 128 is a fundamentally different architectural bet — that ultra-fine-grained routing allows each expert to hyper-specialize. A reasoning query activates a completely different slice of the network than a code generation request or an image understanding task. The benchmark results suggest the routing works well, but the team hasn't published expert utilization statistics. Would love to see those.
Should You Actually Switch?
Running separate models for chat, reasoning, and vision? Migrate. One model to monitor, one set of weights to version, one VRAM budget to plan around. The operational simplification pays for the migration effort on its own.
Already on GLM-5 or Qwen3.5 for maximum raw capability? Stay where you are. Those still lead on absolute performance metrics, and they have larger active parameter counts backing that lead up.
On Mistral Small 3? Upgrade today. The throughput and latency improvements are too large to leave on the table, and the Apache 2.0 license is unchanged.
The pitch here isn't "best model ever released." It's "the one model you stop thinking about." Good enough across every task category, efficient to serve, trivial to deploy. In a landscape where someone claims a new state-of-the-art every week, boring reliability across all workloads might be the harder engineering achievement — and honestly, the more useful one for anyone trying to ship product instead of chasing leaderboards.