Qwen's latest coding model has 80 billion parameters and uses 3 billion of them. The other 77 billion sit there during inference doing absolutely nothing for any given token. On paper, this sounds like a waste of silicon. In practice, it just matched Claude Opus 4.5 on SWE-bench Verified.

The Architecture Nobody Expected

Qwen3-Coder-Next doesn't look like anything else on HuggingFace right now. The MoE layer isn't the usual "pick 2 of 8 experts" setup from Mixtral or Llama 4 Scout. It's 512 experts per layer, 10 selected per token, plus one shared expert that fires every time.

That alone would be notable. But the attention mechanism is where things get strange. The layers alternate between two types of attention in a fixed repeating pattern: three blocks of Gated DeltaNet (a linear attention variant) followed by one block of standard Gated Attention.

12 × [3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE)]

Forty-eight layers total. Three-quarters use linear attention, scaling linearly with sequence length instead of quadratically. The remaining quarter uses full attention to maintain quality on tasks that need exact token-to-token recall. That ratio is what makes the 256K context window practical — quadratic attention across a quarter million tokens on consumer hardware would be brutal.

Benchmarks Worth Trusting

SWE-bench Verified, the version where humans confirmed each test actually validates the fix: over 70%, using the SWE-Agent scaffold. Claude Opus 4.5 lands in the same neighborhood. Three billion active parameters competing with the best proprietary coding models on real-world bug fixing.

SWE-bench Pro, the harder variant: 44.3%. Not frontier-leading, but again — 3B active.

The throughput angle matters more than raw accuracy for agent workflows. Because only 3B params activate per token, you're doing roughly the compute of a small dense model while pulling from a much larger knowledge pool. Qwen claims 10x throughput over comparable-quality dense architectures for repo-level tasks. Real-world numbers with vLLM batching land closer to 6–8x. Still enormous.

Actually Running the Thing

Eighty billion parameters means eighty billion parameters of memory, even when most of them are idle during any given forward pass. Every expert has to live somewhere.

Quant Size RAM/VRAM Notes
Q8_0 ~85 GB ~90 GB Servers, dual A100
Q4_K_M ~46 GB ~48 GB 64 GB Mac or dual consumer GPU
Q3_K_M ~38 GB ~40 GB Tight fit on 48 GB unified
IQ2_XL ~28 GB ~30 GB Minimum viable, noticeable quality loss

MoE models handle CPU/GPU splitting better than dense ones. Sparse experts can sit in system RAM and page in on demand while the shared expert and dense layers stay on the GPU. Both KTransformers and llama.cpp handle this gracefully.

# Ollama (v0.15.5+ required for hybrid attention)
ollama pull qwen3-coder-next
ollama run qwen3-coder-next

# vLLM production serving — two A100s at full precision
vllm serve Qwen/Qwen3-Coder-Next \
  --tensor-parallel-size 2 \
  --max-model-len 65536

On older Ollama versions, you'll get cryptic architecture errors with no useful message. Upgrade first, debug second.

What You're Actually Paying For

The 512-expert design makes the model card misleading to anyone who reads "80B" and thinks about compute. Per-token FLOPs are genuinely 3B-class. Your electricity bill, your latency, your GPU utilization — they all reflect the smaller number.

The tradeoff is memory footprint. A 64 GB M4 Max MacBook can run Q4_K_M with room for your IDE and a browser, and inference speed feels like running a small model. The bottleneck is loading, not computing.

Cloud economics get interesting. You need enough VRAM to host the full expert pool, but FLOPs per request stay low. High-concurrency serving is where this architecture pays off — experts are shared across requests, amortizing the memory cost over many users.

800,000 Bug Fixes

The training pipeline might matter more than the architecture. Qwen's technical report describes 800,000 verifiable coding tasks mined from real GitHub pull requests, each paired with an executable test environment. Not synthetic benchmarks. Not LeetCode problems. Actual patches with actual test suites that either pass or don't.

Reinforcement learning ran against execution results. The reward signal was binary: did the patch pass the tests? This is harder to scale than it sounds — maintaining 800K reproducible execution environments is an infrastructure problem as much as a research one. The result is a model that understands why code is broken, not just what valid code looks like.

Where It Breaks Down

Qwen3-Coder-Next won't replace Claude Opus 4.6 for complex multi-file refactoring sessions where you need the model to hold a coherent plan across dozens of steps. Three billion active parameters show up as brittleness on very long reasoning chains. You'll see it lose the thread around step 15 of a 20-step plan.

Where it dominates: single-file bug fixes, test generation, code review, and any agentic loop where retries are cheap. The throughput advantage means five attempts at a problem in the time a dense 70B model takes for one. Frame the value as "attempts per dollar" rather than "accuracy per attempt" and the math flips hard.

Apache 2.0. No asterisks, no "non-production" clauses, no user-count gates. Ship it wherever you want.