Your LoRA Broke the Model's Math Skills. Here's the Fix.

Everyone who's fine-tuned a model for a specific task has hit the same wall: the model gets great at the new thing, and quietly terrible at everything else. You fine-tune Llama for medical Q&A, and suddenly it can't write Python anymore. You teach it to extract invoices, and it forgets how percentages work. This is catastrophic forgetting, and until recently, your options were "accept it" or "spend 10x the compute on full retraining."

OSFT — Orthogonal Subspace Fine-Tuning — just shipped as part of Hugging Face PEFT, backed by Red Hat's AI Innovation team. It might be the cleanest solution to this problem I've seen.

What catastrophic forgetting actually looks like

You take a 7B model that scores 65% on GSM8K math. You fine-tune it on 50k customer support conversations. It now handles support tickets beautifully. But run GSM8K again: 41%. Your model didn't just learn customer support — it unlearned math.

This isn't a toy problem. A January 2026 study from Tsinghua dug into the mechanics: 15–23% of attention heads undergo severe disruption during standard fine-tuning, with lower layers taking the worst hit. And here's the counter-intuitive part — in the 1B to 7B parameter range, bigger models forget more, not less.

LoRA helps, but only partially. Because adapters sit alongside frozen weights, they don't directly overwrite existing parameters. But the interaction between adapter outputs and the frozen base still shifts internal representations enough to cause drift. Most practitioners report 5–15% degradation on unrelated benchmarks after LoRA fine-tuning, depending on dataset size and learning rate.

The orthogonal trick

OSFT doesn't add adapters. It identifies which directions in weight space the model is already using for existing capabilities, then constrains new learning to the orthogonal complement of those directions.

The intuition: imagine the model's knowledge as vectors in a high-dimensional space. Math ability occupies certain directions. Language understanding occupies others. OSFT decomposes each weight matrix using SVD (singular value decomposition), identifies the high-singular-value directions — the ones carrying critical existing knowledge — freezes those, and only allows gradient updates in the remaining low-singular-value subspace.

The key parameter is unfreeze_rank_ratio. Set it to 0.25, and the bottom 25% of singular-value directions get unfrozen for new learning. The top 75%, the directions carrying the model's existing intelligence, stay locked. Gradients literally cannot flow into the preserved subspace because the projection is orthogonal by construction — not approximately orthogonal, not "mostly" orthogonal, mathematically orthogonal.

This is structurally similar to MiLoRA (which also targets low-singular-value components), but differs in two important ways. First, OSFT uses orthogonal projection to guarantee gradients can't leak into the preserved subspace. Second, rank selection is adaptive per-layer using an input-output cosine similarity importance score, rather than fixing a global rank for all layers. Some layers need more room to learn; others are already well-utilized. OSFT figures this out automatically.

The result: you can stack new capabilities with near-zero degradation on existing benchmarks.

Five lines to try it

Red Hat's Training Hub wraps OSFT into a single function call:

pip install training-hub[cuda]

from training_hub import osft

result = osft(
    model_path="meta-llama/Llama-3.3-70B-Instruct",
    data_path="my_support_data.jsonl",
    ckpt_output_dir="./osft-output",
    unfreeze_rank_ratio=0.25,
    effective_batch_size=16,
    max_seq_len=2048,
    learning_rate=5e-6,
)

That's the whole training loop. Training Hub handles the SVD decomposition, gradient projection, and checkpointing internally. The unfreeze_rank_ratio of 0.25 is a sane default — lower preserves more existing capability but learns slower; higher learns faster but risks more forgetting.

VRAM-wise, OSFT sits closer to full fine-tuning than LoRA. You're training full weight matrices, just with constrained gradients. For a 7B model, budget around 40GB. For 70B, you're looking at multi-GPU territory or the QLoRA training path through Training Hub's Unsloth backend, which cuts VRAM by roughly 70%.

When to use what

Quick heuristic:

LoRA/QLoRA — One task, limited VRAM, existing capabilities are expendable collateral.
Full SFT — Massive compute, maximum single-task performance, you're fine rebuilding from the base checkpoint each time.
OSFT — You're adding capabilities sequentially. Support tickets this month, legal docs next month, code review the month after. Each skill should stack without destroying what came before.

That sequential use case is where OSFT really separates itself. It maintains a fixed-size buffer from the previous task — the memory footprint doesn't grow with the number of tasks you've trained on. Most continual learning methods accumulate replay buffers proportional to task count. OSFT doesn't.

The trade-offs nobody mentions

OSFT is slower. The SVD decomposition at initialization adds overhead, and the orthogonal projection on every gradient step isn't free. On a single A100, expect roughly 1.5x the wall-clock time of standard LoRA for the same dataset.

And unfreeze_rank_ratio is load-bearing in a way that isn't obvious from the docs. Set it too low — say 0.05 — and the model barely learns anything new because there's almost no unfrozen subspace to work with. Set it too high — 0.8 — and you're essentially doing full SFT with extra computational overhead. The sweet spot varies by model and task. Red Hat's recommendation of 0.25 is decent, but if you have the compute budget, sweep 0.1 to 0.4 in increments of 0.05. The difference between 0.15 and 0.30 can be dramatic on retention benchmarks.

There's also the question of what "important directions" means for your specific deployment. The SVD-based importance scoring works well for general-purpose models, but if you've already fine-tuned the model once and the singular value landscape has shifted, the second round of OSFT is working with a different importance map. Red Hat hasn't published ablations on deep stacking — three, four, five sequential OSFT passes — so treat anything beyond two rounds as experimental.

Where it fits right now

OSFT lives in Hugging Face PEFT under the osf module. Training Hub bundles it with Unsloth's CUDA kernels for roughly 2x speed over a vanilla implementation. If you're on OpenShift, the Kubeflow Trainer integration in Red Hat AI 3.3 distributes OSFT jobs across a cluster — but most people reading this probably don't need that.

The real audience for OSFT isn't the one-off fine-tuner. It's anyone maintaining a model that needs to keep learning without periodically regressing. If your workflow involves monthly fine-tuning on new data while praying the last round's improvements survive, OSFT just became the default answer to a question you've been asking for two years.

#What catastrophic forgetting actually looks like

#The orthogonal trick

#Five lines to try it

#When to use what

#The trade-offs nobody mentions

#Where it fits right now