A HuggingFace user named Jackrong quietly uploaded a set of models last week that deserve way more attention than they're getting. The pitch: take Claude 4.6 Opus's chain-of-thought reasoning traces, use them as supervised fine-tuning data, and train Qwen3.5 models to think the same way. The result is a family of distilled reasoners — from 2B to 27B parameters — that produce structured <think> blocks before answering, running entirely on your hardware. No API key. No cloud. Just reasoning in a GGUF file.

The Distillation Recipe

The concept isn't new. DeepSeek did it with R1 earlier this year, training smaller models on reasoning traces from larger ones. What makes Jackrong's work notable is the source material and the training discipline.

The dataset consists of Claude 4.6 Opus responses to complex problems, capturing the structured reasoning patterns — step-by-step decomposition, hypothesis testing, self-correction — that characterize Opus-level thinking. The fine-tuning uses a train_on_responses_only strategy, which means the loss function ignores instruction formatting entirely and focuses exclusively on the model's ability to reproduce the <think> sequences and their corresponding final answers.

Training loss dropped from 0.73 to 0.19, a strong convergence signal. That gap suggests the model internalized reasoning structure rather than just memorizing outputs. Whether that structure generalizes beyond the training distribution is a harder question — more on that below.

The Model Lineup

The full lineup covers nearly every hardware tier:

Model Params GGUF Q4_K_M Min VRAM Sweet Spot
Qwen3.5-2B-Distilled 2B ~1.5 GB 4 GB Phones, Raspberry Pi
Qwen3.5-4B-Distilled 4B ~2.8 GB 6 GB Older laptops
Qwen3.5-9B-Distilled 9B ~5.5 GB 8 GB RTX 3060/4060
Qwen3.5-27B-Distilled 27B ~16 GB 24 GB RTX 3090/4090
Qwen3.5-35B-A3B-Distilled 35B MoE ~21 GB 16 GB* RTX 4080+

*The 35B-A3B is a mixture-of-experts model with only 3B parameters active per forward pass, so effective VRAM usage is lower than the total parameter count implies.

What Happens When You Prompt It

If you've used Claude or DeepSeek R1, the output pattern is familiar. Before answering, the model emits a reasoning trace inside <think>...</think> tags. One important implementation detail: thinking mode stays enabled by default (thinking=1), not silently disabled like some community finetunes tend to do.

The 9B variant handles multi-step math and planning tasks noticeably better than base Qwen3.5-9B. Where base Qwen might jump to an answer and get the intermediate steps wrong, the distilled version decomposes the problem first and catches its own errors mid-stream. Not always — it's still a 9B model — but often enough to matter.

Where It Falls Apart

Domain transfer. Math, code, and logic puzzles are well-represented in the training data, so performance there is expected. Ask the 9B distilled model to reason through a complex ethical scenario or do nuanced literary analysis, and you'll see the gap between a 9B student and an Opus-class teacher open into a canyon. The reasoning traces become superficial — the model goes through the motions of structured thinking without the depth that makes the structure useful. It writes the <think> block, it hits the right structural beats, and the conclusion still misses because the intermediate steps lack the world knowledge to support the scaffolding. Distillation teaches form. It doesn't backfill the facts the smaller model never learned during pretraining. On a creative writing task I threw at it — "write a persuasive argument against your own position on X" — the 9B produced a <think> trace that correctly identified three counterarguments, then fumbled two of them in the actual response because it couldn't hold the nuance through generation. The 27B fared better, which tracks: more parameters, more room for the reasoning structure to actually do work.

Getting It Running

The MarkTechPost guide from March 26 demonstrates a clean dual-backend setup: one path for GGUF via llama-cpp-python, another for 4-bit HuggingFace via bitsandbytes, unified behind a shared inference interface.

For the GGUF route — probably what most people want:

# Download the 9B quantized model (~5.5 GB)
huggingface-cli download \
  Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF \
  --local-dir ./models

# Serve it with llama-server
llama-server \
  -m ./models/qwen3.5-9b-distilled-q4_k_m.gguf \
  -c 8192 -ngl 99 --port 8080

That gives you an OpenAI-compatible API on localhost. Point any client at http://localhost:8080/v1/chat/completions and you're set.

If you'd rather skip GGUF entirely and load straight from HuggingFace with 4-bit quantization:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2",
    load_in_4bit=True,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
    "Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2"
)

Apple Silicon users have an MLX option too — mlx-community/Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-4bit runs the full 27B at roughly 20 tokens/sec on an M4 Max. Not blazing, but usable for interactive work.

The implementation ships with a ChatSession class for multi-turn conversations and parsing utilities that cleanly separate <think> traces from final outputs, which is handy if you're building a UI that wants to show reasoning on demand.

So Is This Legit?

Depends on what you mean by legit.

Is it an official release? No. It's a community finetune by one person. The training data composition isn't fully documented, and nobody has put the distilled models through a standardized eval suite like MMLU or SWE-bench yet. Self-reported results only.

Does it produce meaningfully different output from base Qwen3.5? Yes. The reasoning traces are coherent, rarely hallucinate within the <think> blocks, and the final answers improve on problems that benefit from planning. I tested the 9B on a handful of LeetCode mediums and it solved three that base Qwen3.5-9B couldn't — not by being smarter, but by catching an off-by-one error in its own reasoning and correcting course before committing to the answer.

Not Claude-Quality — But That's Not the Point

Anyone claiming these models replicate Opus-class reasoning is overselling it. Distillation compresses; it doesn't replicate. You get the scaffolding of structured thinking at a fraction of the depth. For many practical tasks — code generation, math tutoring, step-by-step explanations — that fraction is enough.

The broader trend here matters more than any single model. We're three months into 2026 and reasoning distillation has gone from a research technique to something a solo developer can do with Unsloth on a rented A100 over a weekend. The Qwen3.5 base models, with their 262K context windows and hybrid Gated Delta Network architecture, make particularly good distillation targets because they already handle long reasoning chains natively.

If you've got 6 GB of VRAM and want a local model that shows its work, grab the 9B GGUF tonight. It won't replace your API subscription, but it might replace your "quick local sanity check" workflow.