NVIDIA Snuck Mamba Into a 120B Model and Nobody Blinked

NVIDIA dropped Nemotron 3 Super a few weeks ago, and the discourse moved on within 48 hours. Understandable — March was a firehose of model releases. But this one deserved more attention than it got, because it's doing something genuinely weird with its architecture, and the throughput numbers should make anyone running agentic pipelines sit up.

Mamba Meets Transformer, For Real This Time

Remember when Mamba was going to kill the Transformer? That was fun. What actually happened is more interesting: NVIDIA took Mamba's linear-time sequence processing and bolted it onto Transformer attention layers in an alternating pattern. Mamba handles the long-range context cheaply — think of it as the model's "skim reading" mode. Transformer layers kick in for the parts that need precise cross-token reasoning.

The result: 120B total parameters, but only 12B active per token thanks to the MoE routing. That active parameter count puts it in roughly the same ballpark as Qwen3.5's 10B active, but the hybrid backbone fundamentally changes how tokens flow through the network. Linear attention for the boring stretches, quadratic attention when precision matters.

And it has a 1 million token context window. Natively. Not "we trained at 128K and hope it extrapolates."

The Numbers, Honestly

Nemotron 3 Super isn't going to top every leaderboard, and NVIDIA clearly knows that. On MMLU-Pro it scores 83.73% — respectable, but Qwen3.5-122B hits 86.70% and GPT-OSS-120B reaches 90.0%. On SWE-Bench Verified, Qwen also leads with 66.4% vs Nemotron's 60.47%.

But that's not the game this model is playing.

NVIDIA built a new eval called PinchBench — it tests models acting as agent brains across multi-step tool use, planning, and error recovery. Nemotron 3 Super scores 85.6% there, the highest for any open-weight model in its weight class. And on SWE-Bench it still crushes GPT-OSS-120B (60.47% vs 41.90%) while activating roughly 2x the parameters.

Benchmark	Nemotron 3 Super	Qwen3.5-122B	GPT-OSS-120B
MMLU-Pro	83.73%	86.70%	90.00%
SWE-Bench Verified	60.47%	66.40%	41.90%
PinchBench (agentic)	85.6%	—	—
Active params	12B	10B	5.1B
Context window	1M	128K	128K

The pattern: Nemotron trades some points on static knowledge benchmarks for dominance in agentic tasks. Building chat apps? Qwen or GPT-OSS might fit better. Building autonomous agents that hold a million tokens of context and use tools reliably? This is the one.

7.5x Throughput Is Not a Typo

Here's the stat that should stop the scroll. Nemotron 3 Super achieves 2.2x higher inference throughput than GPT-OSS-120B and 7.5x higher than Qwen3.5-122B, measured on 8K input / 16K output sequences.

That's not incremental. That's a different category. The Mamba layers aren't an architectural curiosity — they're pulling real weight. Linear-time sequence modeling means the model doesn't choke on long contexts the way pure Transformers do. When your agent accumulates tool call results, web scrapes, and intermediate reasoning across dozens of steps, that million-token window at 7.5x throughput changes what's actually feasible to build. You could run a complex agentic pipeline that would cost $50 in API calls per run on a pure Transformer, and serve it yourself at a fraction of the latency.

This is where the hybrid architecture stops being a research flex and starts being an engineering advantage.

Hardware Reality

Despite "only" 12B active parameters, every expert weight lives in memory whether it fires or not. The full model is 120B parameters. At 4-bit quantization, expect 64–72 GB of VRAM.

What that means in practice:

Single H100 (80GB): Works fine
Single A100 (80GB): Tight but doable
2x RTX 4090 (48GB total): No — you need more
Mac Studio M4 Ultra (192GB unified): Comfortable, though Mamba layer support in llama.cpp is still maturing

Ollama has it in the library now:

ollama run nemotron-3-super

Be honest about your hardware before pulling 70+ GB of model weights onto your machine. If you want the same architectural philosophy in something you can actually run at home, NVIDIA also shipped Nemotron 3 Nano — 30B total, 3B active, fits in 24GB. It targets the same agentic workload profile. Not the same league as Super, but a legitimate option for local development and prototyping agent flows before scaling up.

For production, vLLM supports it natively, and NVIDIA's TensorRT-LLM runtime is — unsurprisingly — well-optimized for their own model.

Why This Matters Beyond Another Model Drop

The boring take: another big model with impressive benchmarks, moving on. The more interesting take: Nemotron 3 Super is the first production-quality model where Mamba layers demonstrably improve throughput at scale rather than just showing promise on toy benchmarks. And it's purpose-built for agent workloads, not retro-fitted for them after the fact.

If the agentic AI direction continues — and every signal says it will — the models that win won't be the ones with the highest MMLU scores. They'll be the ones that hold a million tokens of context without melting the GPU, call tools accurately, and recover from their own mistakes gracefully. Nemotron 3 Super is NVIDIA's bet on that future. The throughput numbers suggest it's not a bad one.

#Mamba Meets Transformer, For Real This Time

#The Numbers, Honestly

#7.5x Throughput Is Not a Typo

#Hardware Reality

#Why This Matters Beyond Another Model Drop

Mamba Meets Transformer, For Real This Time

The Numbers, Honestly

7.5x Throughput Is Not a Typo

Hardware Reality

Why This Matters Beyond Another Model Drop