Standard speculative decoding has been around for a while. A small draft model guesses the next tokens, the big model verifies them in one pass, and if the guesses land, you skip a bunch of sequential forward passes. Neat trick. Except on Apple Silicon, the implementations have been janky — fragile wrappers, CUDA-first code with Metal as an afterthought, performance that barely justified the complexity.

DFlash changed the math. And now someone ported it natively to MLX.

Block diffusion isn't your usual speculative decoding

The difference is architectural and it matters.

Traditional speculative decoding (EAGLE, Medusa, and friends) uses an autoregressive drafter — a tiny model that generates candidate tokens one at a time, just faster than the target. The cost scales linearly with how many tokens you draft. Want to draft 16 tokens? That's 16 sequential forward passes through the drafter.

DFlash uses a block diffusion model instead. The drafter generates all 16 candidate tokens in a single denoising step. One forward pass, 16 tokens. The cost is essentially flat regardless of draft length.

This is why the speedup numbers aren't incremental. Z Lab's paper (arXiv 2602.06036) reports over 6x lossless acceleration compared to standard autoregressive decoding, and 2.5x faster than EAGLE-3 across math, code, and chat benchmarks. "Lossless" means the output is identical — speculative decoding is an exact inference technique, not an approximation. Same tokens, fewer forward passes.

The MLX port

dflash-mlx is a ground-up implementation for Apple Silicon by Aryagm. Not a wrapper around CUDA code. The speculative decoding primitives are built directly on Metal, and the drafter conditions on intermediate layer activations from the target model — not just the final logits. That conditioning path is what keeps acceptance rates high enough to make the whole thing worthwhile.

The headline number: Qwen3.5-9B hitting 85 tokens per second on an M5 Max. That's a 3.3x speedup over standard autoregressive generation of the same model on the same hardware. For context, that's faster than most people get running a 4-bit quantized model through Ollama on the same chip.

Getting started takes about 30 seconds:

git clone https://github.com/Aryagm/dflash-mlx.git
cd dflash-mlx
uv sync
uv run dflash-mlx --max-new-tokens 256

First run downloads the default model pair (Qwen3-4B BF16, ~12 GB total for target + draft). You can swap in the 8B:

uv run dflash-mlx \
  --target-model z-lab/Qwen3-8B-DFlash-b16 \
  --draft-model z-lab/Qwen3-8B-DFlash-b16-drafter \
  --max-new-tokens 512

There's dflash-mlx-chat for interactive sessions and a Python API for integration:

from dflash_mlx import DFlashGenerator

runner = DFlashGenerator()
result = runner.generate(
    "Explain quicksort in three sentences.",
    max_new_tokens=128
)

Why this hits differently on Apple Silicon

Memory bandwidth is the bottleneck for LLM inference, not compute. This has been true for years, but it's especially visible on unified-memory architectures where the GPU and CPU share a single memory pool.

The M5 Max pushes 153 GB/s — 28% more than M4 Max. Apple also baked Neural Accelerators inside each GPU core for matrix multiplication, yielding up to 4x speedup on time-to-first-token. MLX was built to exploit all of this, and benchmarks consistently put it 20-87% ahead of llama.cpp for generation on models under 14B parameters.

DFlash compounds the advantage. Because the drafter's cost is flat (one forward pass regardless of draft length), the technique benefits disproportionately from Apple's wide memory bus. Traditional autoregressive drafters still need sequential memory reads proportional to draft length. Block diffusion doesn't. On hardware where memory bandwidth is the ceiling, cutting memory access patterns by a factor of 8–16x is the difference between "marginally faster" and "genuinely usable."

The gap narrows above 27B, where you saturate bandwidth regardless of technique. But for the 4B–14B sweet spot — which is where most local inference actually happens — DFlash on MLX is the fastest way to run a model on a Mac right now.

The rough edges

Qwen3.5 support exists but isn't fully optimized. The hybrid attention stack (mixing recurrent linear attention with standard attention) complicates the drafter's conditioning path. It runs. You won't see the same 3x multipliers.

Model coverage is narrow. Z Lab has published DFlash-compatible weights for Qwen3-8B and Qwen3-Coder-30B-A3B on Hugging Face. If your preferred model isn't on the list, you'll need to train your own drafter — there's a guide in the repo's ADDING_MODELS.md, but calling it trivial would be generous.

One more thing: the benchmarks separate warmup from measurement. MLX has kernel compilation, lazy evaluation, and graph caching that inflate the first few runs. The reported tok/s is steady-state performance, which is fair for sustained generation but means your first prompt will feel slower than the numbers suggest.

Where this is going

A year ago, running an LLM on a MacBook was a novelty. Slow, but it worked. Now MLX is a serious inference runtime, Ollama ships with an MLX backend, and techniques like DFlash are delivering throughput that would've required a dedicated GPU server eighteen months ago.

85 tokens per second from a 9B model isn't just a nice benchmark. It's the point where local inference stops feeling like a compromise — code completion, chat, document summarization all become responsive enough that you stop reaching for the API. If you've got an M-series Mac gathering dust as an inference platform, this is the week to dust it off.