Google shipped Gemma 4 yesterday under Apache 2.0, and while the 31B dense model grabs the headlines, the real story is the 26B Mixture-of-Experts variant that only fires 4 billion parameters per token. I pulled the GGUFs, ran some comparisons, and here's what actually matters.
The Lineup
Four sizes, all built on Gemini 3 technology:
| Model | Total Params | Active Params | Context | Target Hardware |
|---|---|---|---|---|
| E2B | 2B | 2B | 128K | Phones |
| E4B | 4B | 4B | 128K | Laptops, edge devices |
| 26B-A4B | 26B | 4B | 256K | Consumer GPUs |
| 31B | 30.7B | 30.7B | 256K | 16GB+ VRAM GPUs |
The E2B and E4B target on-device use — Android phones, Chromebooks, that sort of thing. The 31B is a full dense transformer. But the 26B-A4B is the interesting one.
Google borrowed from Mistral's playbook here. The MoE architecture routes each token through a 4B active parameter subset, so you get inference speed comparable to a 4B model with reasoning quality that approaches the 31B. Multimodal out of the box — text, images, audio. The vocabulary is massive at 262K tokens covering 140+ languages.
The Benchmarks Tell a Specific Story
Gemma 3 was decent but never felt like it could go toe-to-toe with Qwen or Llama at the same parameter count. That just changed.
The jump on AIME 2025 is absurd: Gemma 3 scored 20.8%. Gemma 4 hit 89.2%. Not incremental — generational. GPQA Diamond nearly doubled. BigBench Extra Hard went from 19% to 74%. Codeforces ELO leapt from 110 to 2150.
For context, Qwen 3.5-27B scored 48.7% on AIME 2025. The new model at roughly the same parameter count nearly doubles that. On math and hard reasoning, Google just vaulted to the front of the open-weight pack.
But here's the nuance that benchmark roundups tend to skip: Qwen 3.5 still leads on coding. LiveCodeBench and SWE-bench show clear margins for the Qwen family. If your primary use case is code generation or tool calling, Qwen remains the better foundation. And if you need absurd context lengths, Llama 4 Scout's 10-million-token window occupies a category by itself.
Running It
Ollama had day-zero support. The commands are what you'd expect:
# The MoE — recommended for most people
ollama run gemma4:26b
# The full dense model
ollama run gemma4:31b
# Lightweight for laptops
ollama run gemma4:e4b
VRAM reality check for each variant:
E4B: Anything with 10+ GB handles it. Apple Silicon unified memory works fine.
26B MoE (Q4_K_M): Budget around 20 GB total memory. On an RTX 4090 with 24 GB, it fits comfortably and the 4B active parameter footprint keeps generation snappy.
31B dense (Q4_K_M): Tight squeeze on 24 GB VRAM. A 4090 can do it, but spilling into system RAM tanks generation speed by 3-10x. You really want headroom here.
Unsloth already published GGUFs on HuggingFace — unsloth/gemma-4-31B-it-GGUF and unsloth/gemma-4-26B-A4B-it-GGUF. NVIDIA has an NVFP4 quantization of the 31B optimized for RTX cards at nvidia/Gemma-4-31B-IT-NVFP4.
The MoE is the sweet spot for most setups. Near-31B reasoning, 4B inference cost, fits on consumer hardware without drama.
Apache 2.0 Is the Quiet Bombshell
This matters more than any benchmark number. Gemma 3 came with Google's permissive-but-not-quite license that spooked legal teams. Gemma 4 ships under full Apache 2.0 — identical terms to Qwen 3.5 and most of the HuggingFace ecosystem.
No monthly active user caps (Llama 4's community license draws the line at 700M MAU). No clauses about competing products. Fine-tune it, quantize it, deploy it commercially, embed it in your product — no lawyer required.
For startups and enterprises evaluating open models, this removes the last friction point that kept the Gemma family off shortlists. The model is now as legally simple to adopt as Qwen.
Pick Your Workload, Not Your Leaderboard
The open-weight landscape in the 25-35B range has real specialization now:
Math and reasoning — Gemma 4 26B-A4B. The AIME results are genuine and the MoE efficiency makes it practical to actually deploy.
Coding — Qwen 3.5. LiveCodeBench and SWE-bench margins are clear. If you're building copilots or code agents, start here.
Long context — Llama 4 Scout. Ten million tokens. Nothing else is close. Essential when you need to ingest entire codebases or massive document sets.
Chasing a single model that tops every benchmark average is a trap. Third place on a leaderboard composite might mean first place at the specific thing you need.
What I'd Actually Deploy
The 26B MoE earns a spot in my daily rotation alongside Qwen 3.5 32B. The Google model handles reasoning-heavy and multimodal tasks; Qwen takes code generation and tool use. Both fit on a single 4090 — just not at the same time.
The E4B is going on my laptop. Ten gigs of memory buys you a genuinely capable multimodal model with a 128K context window. For something originally designed to run on a phone, that's a ridiculous amount of capability.