Five multi-model patterns that cut token costs — and keep your data where you want it

Frontier API pricing has climbed sharply over the past year, and usage has climbed faster. If you've watched a single agentic session burn through a day's token budget, you're not alone. Multi-step sessions, repo-wide iteration, and long context windows consume tokens at rates flat pricing never anticipated. A couple of commits on a complex codebase can eat half a month's budget.

The obvious response is to switch to a cheaper model. But a single cheaper model recreates the same problem one tier down: you're still paying one rate for everything, hard reasoning and autocomplete alike. Some tasks genuinely need the strongest frontier model. A sub-agent exploring your codebase doesn't.

In this post, you'll learn why you should stop using one model for every task. And that every routing decision is also a data-boundary decision. The same architecture that cuts your token bill determines whether your data ever leaves the device, and whose jurisdiction it lands in when it does.

We'll work through five architectural patterns, distinguished by what triggers the escalation to a stronger model:

Feature routing — assign model tiers by feature, upfront
Cascade — try the cheap model first, escalate on failure
Advisor — a cheap executor consults a stronger planner
Specialist — a generalist hands off to a model with a capability it lacks
Draft-and-verify — generate on-device, verify remotely

Three dimensions of every routing decision

While cost is on the minds of many right now, it's not the only aspect to consider when moving to a multi-model stack. Cost, capability, and data sovereignty all decide where a task should run, and each may pull you in a different direction.

Cost. Calling a hosted Gemma 4 costs less than €0.5 per million tokens, while a strong reasoner from one of the big AI labs can run €15–50 per million output. Even within a single provider, the largest model can be 10 times the price of the smallest. For high-volume agent work, routing to the right model can drastically reduce your token costs.

Capability. Now, some tasks may still require the strongest available reasoning. But as even the smaller models are becoming increasingly more capable, it's not as clear-cut anymore. Default to cheap models and escalate only when the evidence says you must.

Data sovereignty. Where a task runs determines which rules apply. GDPR restricts transfers of personal data outside the EU/EEA, and the EU AI Act adds obligations on top. On-device inference and EU-hosted sovereign inference both satisfy jurisdiction by design, while US-hosted APIs require added legal attention. This is the second thread we'll track through every pattern, alongside cost.

Note that running models on-prem isn't necessarily cheaper. A local 70B model on a rented GPU can cost more than API calls at low volume; the savings appear once volume is high enough that fixed hardware cost beats marginal token cost. And none of the patterns below requires local inference — routing between a cheap and a strong hosted model is still a multi-model stack, and at low volume it's often the right one.

Where to run your models

The question is no longer "which model is best?" It's "what does the task need?" — and raw capability is only one of those needs. A tight latency budget, or a requirement that data never leaves the device, narrows the hardware choice just as firmly as reasoning difficulty does. Think of it as a resource spectrum rather than a hierarchy of developer machines.

Start with the constraints of where the model runs:

Edge and embedded devices — Tiny models (≤4B parameters, quantised) on phones, gateways, single-board computers, and industrial controllers. Well-defined tasks: classification, intent detection, summarisation, and autocompletion. No network round-trip, and no data leaving the device — often the whole point in IoT deployments, where connectivity is intermittent or the data is sensitive by default.
Consumer hardware with a dedicated GPU — Small models (4–40B, quantised) for conversations and straightforward coding tasks.
Workstation or on-prem server — Medium models (40–150B). Large models only for tasks that genuinely need them.

Large models (>150B) put you in multi-GPU territory: expensive, supply-constrained, and rarely worth owning. That's where hosted inference wins on economics, not just convenience.

Five patterns for multi-model architectures

So how do you manage and orchestrate workflows across multiple models? As the introduction previewed, what separates the five patterns is where the decision to escalate lives. That's worth dwelling on, because it's the hardest problem in any multi-model stack: there's no robust way for a small model to know it's beyond its own depth. To avoid relying on each model admitting where it falls short, each pattern externalises that judgement: into static configuration (Pattern 1), a deterministic check (Pattern 2), or a stronger model's review (Patterns 3–5).

Pattern 1: Feature-deterministic routing

The simplest and most common pattern: assign tiers permanently by feature or role. Autocomplete runs on-device, chat runs on hosted open weights, and the hardest reasoning tasks are routed to the strongest available model. The autocomplete assignment isn't only about cost: a 7B parameter model on Apple Silicon returns completions in under 100 ms, where a round-trip to a remote API takes 300–2000 ms. For keystroke-level features — or control loops on a device — on-device isn't optional.

Continue allows you to configure separate models for autocomplete, chat, and edit. Zed splits the same way, to allow local edit prediction with a hosted agent panel.

The advantage is debuggability: no surprise escalations, routing logic stays auditable as static configuration. The disadvantage is inflexibility: tier assignments never adapt to the individual query. If that rigidity starts to hurt, classifier-based pre-routing introduces a small classifier that predicts whether the task needs the stronger model. RouteLLM reports ~95% of the strong model's quality while escalating only a minority of queries — though most of the gain comes from routing at all, not from the specific algorithm.

The savings depend on your workload mix: if 80% of your token volume is autocomplete and chat it can drastically reduce your spending.

If you're in a regulated industry, this isn't just cheaper — it's the difference between "we keep everything on-device" and "we need a DPA with our API provider."

What leaves the device: Entirely within your control, choose the model and provider based on how sensitive your data is.

Feature routing is simple, but the decision is locked in at build time — and even using a classifier only predicts difficulty before seeing an answer. Pattern 2 moves the decision to after the attempt: try cheap, check the result, escalate on failure.

Pattern 2: Cascade

Try the on-device model first. Escalate only when a check fails.

FrugalGPT demonstrated the ceiling: up to 98% cost reduction while matching the strong model's quality. (AutoMix layers a learned router on the same idea, if you want to go deeper.) In production, the simplest form is retry-on-failure: the on-device model writes code; if tests fail, re-send to a stronger model. This is also the canonical edge architecture: the on-device model handles what it can, and only the ambiguous cases escalate to a gateway or hosted model — which is why the data-boundary column matters double when the device sits in a factory or a home.

The verifier matters: free, deterministic checks make the best gates. Does the code compile? Does the test suite pass? Does the output parse? What doesn't work is asking the model how confident it is. Verbalised confidence is systematically inflated (instruction tuning optimises models to sound sure, which degrades calibration), and smaller models are the worst calibrated of all. Gate on checks, not self-reports.

What leaves the device: only the failed queries — the prompt, the failed draft, and the test output. Successful queries, the majority, never cross the network.

Cascades save money, but they add latency on the escalated path and they need a reliable verifier. Pattern 3 inverts the relationship: instead of the cheap model failing up, it consults a stronger model as a tool.

Pattern 3: Advisor

A cheaper executor model drives the agent loop and consults a stronger model as a tool for hard sub-problems. The advisor returns a plan or correction; the executor continues generating output.

Aider's Architect/Editor mode was the open-source predecessor: a reasoning model plans; a fast, cheap editor emits diffs. Anthropic has since formalised the pattern as a hosted advisor tool (server-side sub-inference within a single API call). According to their own number, a small model with a frontier advisor more than doubled its score on a hard agentic benchmark, at a fraction of the cost of running a mid-tier model alone.

One difference matters if data minimisation is why you're here: Anthropic's server-side advisor sees the full transcript. The hand-rolled variant below sends the advisor only the task and the draft — you control exactly what crosses the boundary.

One failure mode to design against: without an explicit "return unchanged if accepted" instruction, executors may invent edits even when the advisor approved the draft.

What leaves the device: the task description and the draft, once per consultation. The full working context — files read, earlier turns, tool outputs — stays with the on-device executor.

The advisor pattern works well when only a fraction of turns need deep reasoning. But what if the problem isn't reasoning depth at all? What if it's a capability the on-device model simply lacks?

Pattern 4: Specialist

A generalist model consults a specialist for capabilities it lacks. Not just a stronger version of itself, but a different kind of model entirely: vision, mathematical reasoning, PII redaction.

The challenge is what context to forward to the specialist. Forward the full transcript? Summarise and carry forward? The summary handoff below is the most common production choice: lossy but bounded in cost, and most of the conversation stays on-device.

What leaves the device: a 3–4 sentence summary, the triggering message, and the image. The transcript itself never crosses.

The specialist pattern becomes practical when your provider exposes multiple models behind one endpoint. One API key, one base URL, and the specialist is just another model name in the request — the marginal integration cost rounds to zero.

Specialist consultation is powerful, but passing context back and forth can be complex. Pattern 5 shrinks the remote role to a single verification pass.

Pattern 5: Draft-and-verify

The on-device model proposes a complete output — a tool plan, a code block, a draft response. A stronger model verifies it in a single forward pass rather than regenerating from scratch.

The verification task needs to do work that a deterministic check can't. "Is this valid JSON?" is the wrong job for a model — json.loads is free, which is exactly why it makes a good cascade gate in Pattern 2 but a wasteful verifier here. The pattern pays off when verification is genuinely hard. Think: does this tool plan accomplish the goal? Does the code change preserve invariants? The verifier returns a short structured judgement; only on rejection does the stronger model generate a correction.

Note how the data flow differs from the cascade. There, the stronger model is consulted only when a check fails; here, every draft crosses the network for verification. The saving is that the stronger model mostly judges rather than generates, and writes only on rejection. And because the verifier sees only the draft and the original request, never the full generation context, you can limit the data sent to the provider.

What leaves the device: every draft and its original request — unlike the cascade, verification always crosses the network. On rejection, the corrected plan is generated remotely too.

The complexity is in choosing the verification task: if the verification is more costly than the generation, the savings disappear.

It earns its keep in a narrow window: when verification is too hard for a deterministic check, but still cheaper than generating from scratch.

Summary of the five patterns

Pattern	When to reach for it	Main trade-off	What leaves the device
Feature-deterministic routing	You know the workload mix upfront	Static; overpays for straightforward queries	Everything sent to hosted features; nothing from on-device ones
Cascade	Cost-sensitive; verifiable tasks	Latency on escalated path	Only failed queries (the minority)
Advisor	Long-horizon agentic tasks	Streaming pauses during consultation	Draft + task description per consultation
Specialist	Generalist lacks a specific capability	Context-passing complexity	A context summary + the specialist's input
Draft-and-verify	Verifiable outputs; experimental	Verification task must be non-trivial	Every draft + the original request

Your first routing decision

You don't need five patterns on day one. Start with feature-deterministic routing: one on-device model for what the hardware can handle and one hosted API for what it can't is the most common starting point. Instrument per-model latency, cost, and success rate. If the on-device model did well, try an even smaller one until it starts struggling. A coding harness that lets you define agents with custom models, such as OpenCode, makes it easier to experiment.

Some workloads genuinely require the largest available model. And with the continuous improvements in large language models, the boundaries will shift — what needs to be hosted today may run on-device tomorrow. The goal is intentional architecture, not purity.

If you want to read more about combining models, see Model chains in our developer docs.