You spend hours picking between Q4_K_M and Q5_K_S, shaving a few hundred megabytes off your model file. Then you load a 32K-context conversation and watch your GPU memory climb right back up. The problem was never the weights.

At long context lengths, the KV (key-value) cache — the memory structure that stores attention state for every token in the sequence — dominates your VRAM budget. A 70B model at 128K context can burn through 40+ GB of KV cache alone, dwarfing the model weights themselves. And until last month, nearly all quantization research focused on making model files smaller while ignoring the elephant consuming your runtime memory.

Google's TurboQuant: 3 Bits, Zero Training

Google Research published TurboQuant on March 25 — a training-free vector quantization algorithm that compresses the KV cache down to 3 bits per value. The paper heads to ICLR 2026 in Rio de Janeiro later this month, but the community didn't wait. PyTorch and MLX implementations are already live on GitHub.

The pitch: 5x compression of the KV cache with 99.5% attention fidelity. No fine-tuning. No calibration data. You apply it to any transformer model post-hoc, like flipping a switch.

Here's the pipeline stripped to essentials:

Random rotation. TurboQuant randomly rotates the key and value vectors before quantizing them. This sounds bizarre, but the math is clean — rotation spreads information evenly across dimensions, turning messy real-world distributions into something a simple quantizer handles well.

Per-dimension scalar quantization. After rotation, each dimension gets independently quantized to 3 bits. Because the rotation already normalized things, this brute-force approach works far better than it has any right to.

QJL residual correction. The last 1 bit of the budget goes to a Quantized Johnson-Lindenstrauss sketch that mops up residual error. Think of it as a mathematical error-checker that eliminates the bias scalar quantization introduces on outlier values.

The entire thing is data-oblivious. No training examples, no calibration set, no per-model tuning. Rotate, quantize, correct. Done.

Real Numbers

Google benchmarked TurboQuant on Gemma and Mistral families. The headline: 4-bit mode on H100s delivers 8x the throughput of unquantized 32-bit keys. At 3 bits, you get roughly 5x memory reduction with quality that's virtually indistinguishable from full precision on downstream tasks.

In concrete terms — a 70B model with a 32K context window whose KV cache was eating 12 GB of VRAM? TurboQuant at 3 bits drops that to about 2.4 GB. That's the difference between needing a second GPU and not needing one.

What I find more compelling than the compression ratio is the latency story. Smaller cache means less memory bandwidth consumed during attention computation. Google reports measurable speedups on both prefill and generation, not just memory savings. You're not trading speed for space — you're getting both.

This Isn't Weight Quantization

People keep conflating these, so let me be blunt.

Weight quantization (GGUF, AWQ, GPTQ) shrinks the model file. Your Q4_K_M of Llama 70B turns 140 GB of FP16 weights into ~40 GB on disk and in memory. That's a solved problem with mature tooling — Unsloth's Dynamic 2.0, llama.cpp quants, the whole ecosystem.

KV cache quantization shrinks the runtime memory that grows with every token of context and every request in a batch. This is the part that's been quietly destroying your inference budget, especially if you're doing anything beyond single-turn chat.

They stack. Run a Q4 GGUF model AND apply KV cache compression. Unsloth handles the weights; TurboQuant handles the runtime overhead. Two orthogonal optimizations attacking different halves of the memory equation.

For anyone running RAG pipelines, agentic loops, or multi-turn conversations with long system prompts, the KV cache has been the silent budget killer. Weight quantization was the loudest optimization. KV cache compression might end up being the most consequential.

Can You Use This Today?

Partially. The community moved before Google shipped official code:

tonbistudio/turboquant-pytorch is a clean from-scratch PyTorch implementation. Inference-only, no serving framework integration yet, but it works and includes benchmarks reproducing the paper's claims.

OnlyTerp/turboquant offers another solid implementation with MLX support for Apple Silicon users who want to test on M-series Macs.

A C/CUDA port targeting llama.cpp exists but hasn't been merged upstream. vLLM has a feature request filed with preliminary integration work underway. Ollama — nothing yet.

If you want to experiment:

git clone https://github.com/tonbistudio/turboquant-pytorch
cd turboquant-pytorch
pip install -e .
from turboquant import TurboQuantConfig, apply_turboquant

config = TurboQuantConfig(bits=3, residual_bits=1)
model = apply_turboquant(model, config)
# KV cache now compressed during inference

Realistic timeline for production tooling: Q2 2026, maybe Q3 if Google's reference implementation takes its time. The ICLR presentation in late April should accelerate things — there's usually a code drop around conference time.

Where This Is Heading

KV cache compression isn't conceptually new. Multi-Query Attention and Grouped-Query Attention already reduce cache size at the architecture level. But those require design decisions at training time. TurboQuant's edge is that it works on anything already deployed, retroactively.

The compounding story is what gets me. Weight quantization took us from "you need 8 A100s" to "runs on a 3090." KV cache quantization is about to take us from "8K context on consumer hardware" to "32K+ without flinching." Stack both and suddenly that 70B model serving multi-turn agent conversations on a single 24 GB card stops being a fantasy.

Google solving this with a zero-training method means the improvement is essentially free. No compute to apply it, no quality tradeoff to agonize over. The only bottleneck is waiting for llama.cpp and vLLM to ship the plumbing.