Ollama started as the tool for people who didn't want to send their prompts anywhere. Pull a model, run it on your own hardware, keep everything local. That was the deal.

Then they added a cloud button.

What Actually Happened

Starting with v0.12, Ollama introduced cloud models — large models hosted on Ollama's infrastructure that you access through the same CLI and API you've always used. Append :cloud to a model name, and instead of grinding through tokens on your RTX 4060, the request gets proxied to datacenter GPUs running Blackwell hardware.

The integration is seamless in a way that's almost suspicious. Same ollama run, same API endpoints, same streaming behavior. Your local tools, MCP integrations, and scripts don't know the difference. Under the hood, when the server detects a :cloud model tag, it intercepts the request and routes it through an authenticated proxy to their remote inference infrastructure instead of hitting the local scheduler. No separate SDK, no new authentication flow, no config files. Just a suffix.

By v0.17, the story got bigger. A new inference engine replaced the old llama.cpp server mode with tighter library integration. Streaming tool calls landed — previously, the system had to wait for the entire output before deciding whether it was a tool call or regular text. Thinking model support arrived for DeepSeek R1 and Qwen 3. But the cloud feature is the one that rewrites what Ollama fundamentally is.

The Models You Can't Run Locally (But Now Can)

The cloud library includes the heavy hitters that need 80GB+ VRAM to run at any reasonable speed:

  • Kimi K2.5 — sitting at #1 on LMArena's leaderboard

  • GLM-5.1 — 744B MoE with 40B active parameters per token

  • MiniMax M2.7 — multimodal heavyweight

  • DeepSeek-V3 — 671B MoE

  • Qwen3-Coder — 480B parameters

These aren't models you're running on consumer hardware. GLM-5.1 alone needs roughly 150GB of VRAM for FP16 inference. The :cloud tag turns them into something you can test from a MacBook Air.

Unlike local models, cloud models don't require ollama pull. The system validates availability via a remote status endpoint and streams directly. First token latency is higher than local inference (network round trip), but sustained throughput on a 671B model blows away anything a single consumer GPU can manage.

What It Costs

Plan Monthly What You Get
Free $0 $10 shared credits/month (30K Llama 4 Maverick requests)
Pro $20 Higher session and weekly limits
Max $100 Heavy daily use capacity

Session limits reset every 5 hours, weekly limits every 7 days. Dedicated endpoints are also available if you want guaranteed capacity: 0.80/hour for an A10G, 2.40/hour for an A100 40GB, $4.80/hour for an A100 80GB. Usage-based metered pricing is coming but not live yet.

The free tier is genuinely useful for evaluation. Enough credits to test whether GLM-5.1 actually handles your agent workflow better than your local Qwen3.5 14B before you commit money. The Pro tier at $20/month lands in the same range as ChatGPT Plus, except you're hitting open-weight models through your own toolchain.

The Privacy Question That Got Quieter

Here's the thing that should bother more people than it does: Ollama was the tool for local inference. That was its identity. Adding cloud models doesn't remove local support, but it changes the pitch from "never send your data anywhere" to "send your data to us, we're cool."

Some people chose Ollama specifically because they work with sensitive data — medical records, proprietary code, legal documents. The :cloud suffix routes through Ollama's servers. That's a fundamentally different trust model than running Qwen3 14B on your own metal.

But the real story is economic, not ideological. Most developers aren't privacy absolutists. They picked local inference because API pricing for frontier models gets expensive at scale, and the open-weight models that fit on a 24GB card handle most tasks fine. Cloud changes the calculus: prototype locally with a small model, validate against a frontier-scale open model via cloud, then decide whether to invest in the hardware to self-host it. A workflow that required different tools, APIs, and code paths before now requires one character.

When to Stay Local, When to Phone Home

# Same CLI, same everything. One suffix.
ollama run qwen3.5:14b "Summarize this PR diff"
ollama run glm-5.1:cloud "Review this architecture for race conditions"

Phone home when you're evaluating whether a 400B+ model justifies the hardware investment, running occasional complex reasoning that your local 7B fumbles, prototyping agent workflows that need strong tool calling, or benchmarking your fine-tuned model against a frontier baseline.

Stay local when you're handling regulated or sensitive data, need high-throughput inference (vLLM on your own hardware still crushes cloud latency under concurrent load), your 14B–32B model already covers the use case, or you're making thousands of requests daily and watching costs.

The 80/20 split works well here: local model handles the bulk, cloud handles the hard cases. Ollama doesn't have automatic routing for this yet — you choose per-request — but the plumbing is there for someone to build it.

Where This Goes

Ollama wants to be the universal inference layer, GPU location be damned. Between the new engine, streaming tool calls for MCP-based agents, thinking model support, and now cloud offloading, the trajectory is clear. The "local" in "local LLM tool" is becoming optional.

Whether that's a betrayal or an evolution depends entirely on why you showed up in the first place. If you came for the privacy, the local option hasn't gone anywhere. If you came for the simplicity of ollama run compared to wrestling with API keys and SDKs, the cloud feature just made your life better.

The most interesting question isn't whether Ollama should have done this. It's why it took until 2026 for someone to make the boundary between local and cloud inference this invisible.