Beyond local vs. cloud: five patterns for combining local and hosted open-model inference

Marcus Olsson
Marcus Olsson

Developer Relations

Server rack with AMD MI300x accelerators in a modern data center

There's a debate that has calcified into two camps. On one side, the self-hosting purists: local inference is the only way to guarantee privacy, control costs, and avoid vendor lock-in. On the other, the API maximalists: hosted models are so far ahead of anything you can run locally that running smaller models yourself is a waste of engineering time. Both positions have a point, but neither is the architecture we'd recommend.

The case we want to make is simple. Combine them. Run small, fast open models locally for the work that benefits from being on-device. Route to larger open models — running on serverless infrastructure like Berget — for the work that genuinely needs them. The interesting architectural question is not whether to combine the two but how to wire the routing.

To make the recommendation concrete, it helps to think of model capability less as a set of tiers and more as a continuum tied to the hardware that runs it. At one end, small models — 4B to 14B parameters — run comfortably on a recent laptop. Larger ones, in the 30B–70B range, run on a workstation with a consumer-grade GPU and enough VRAM. The strongest open-weight models available today — Kimi K2.6, Qwen 3.7 Max, GLM 5.1, DeepSeek V4 Pro — need specialised infrastructure: multi-GPU rigs with substantial VRAM, the kind of hardware most teams don't want to own and operate. That's what we host for you on Berget.

The rest of this post is about the patterns that connect the two ends of that continuum.

Why combine at all

Four reasons, in roughly the order they tend to matter.

Compliance. GDPR Article 25 mandates data protection by design. For marketing, HR, or clinical data, keeping personal information on localhost — or, when off-device, inside EU jurisdiction — is not a preference but a legal requirement. The EU AI Act adds further constraints on high-risk applications. A local model can process the sensitive context, with only anonymised or non-personal sub-tasks routed to a hosted endpoint.

Latency. A 7B-parameter model on recent Apple Silicon returns autocomplete suggestions in under 100 ms. A round-trip to a hosted endpoint is typically 300–2000 ms depending on geography and load. For keystroke-level features, local is not optional.

Cost. Local inference is amortised hardware. Hosted inference is marginal spend. But this isn't a universal win — a large model on rented dedicated GPU can cost more than serverless API calls at modest volume, and a small local model running constantly can cost more in electricity than the equivalent hosted calls would have. Cost is a real factor, not the headline argument.

Capability. The gap between what fits on local hardware and what runs on data-centre infrastructure is narrowing but real. Some reasoning tasks, long-context work, and complex tool use genuinely benefit from the larger open models. The architectural question is how to route the small fraction of work that needs them, without paying that cost for everything.

Five patterns

We'll walk through five patterns for connecting local inference to hosted models. Feature-deterministic routing assigns tiers permanently by feature — autocomplete local, chat hosted. Cascade tries the cheap tier first and escalates only if a verifier fails. Advisor lets a cheaper executor consult a stronger model for hard sub-problems mid-task. Specialist consultation hands off to a model with a capability the generalist lacks — vision, long context, mathematics. Draft-and-verify has the cheap model propose a complete output and the stronger model verify it in a single pass. Each has a place; the question is which one fits your workload.

1. Feature-deterministic routing

The simplest and most common pattern: assign the local or hosted side permanently by feature or role. Autocomplete runs locally. Chat runs on Berget. The hardest reasoning tasks — architecture design, complex debugging — escalate to whichever larger model you've chosen for that role.

This is what Continue.dev does in its config.yaml: the autocomplete role points to a small local model running through Ollama; the chat role points to whatever provider you've configured; the edit role can point to a third. Zed splits similarly: Edit Prediction runs against a local model; the Agent Panel calls a hosted endpoint. Nextcloud Assistant wires each task type — summarise, translate, chat — to a separate provider app, with local llama.cpp as one option and an OpenAI-compatible hosted endpoint as another.

The advantage is that this is trivial to debug. There are no learned components, no calibration problems, no surprise escalations. The disadvantage is that it is static — a simple chat query still pays hosted-tier rates because the routing decision was made at feature-design time, not query time.

Feature-deterministic routing The router assigns each feature to a fixed tier at configuration time

from openai import OpenAI
import os

local = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
berget = OpenAI(base_url="https://api.berget.ai/v1", api_key=os.environ["BERGET_API_KEY"])

def route_by_feature(feature: str, prompt: str) -> str:
    if feature == "autocomplete":
        client, model = local, "qwen3-coder"
    else:
        client, model = berget, "qwen-3.7-max"
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

2. Cascade / escalate-on-failure

Try the cheapest tier first. Escalate only if a verifier signals low confidence or an objective test fails.

FrugalGPT demonstrated this with GPT-3.5 as primary and GPT-4 as fallback, achieving up to 98% cost reduction while matching GPT-4 quality. AutoMix adds a learned router on top of self-verification — instead of hand-coding the escalation logic, it trains a classifier that predicts when the cheap model is likely to fail — and reports over 50% computational cost reduction at similar performance. In production coding tools, the simplest version is retry-on-failure: a local model writes the code; if compilation or unit tests fail, the same task is re-sent to a stronger model.

The verifier matters more than the choice of cheap and expensive model. Options include self-consistency (generating multiple samples from the cheap model and measuring how much they agree — high agreement is a proxy for "the model knows what it's doing"), few-shot self-evaluation by the same cheap model, or external validation like compilation or JSON-schema checking. What does not work is asking the model how confident it is. Multiple studies have shown that language models are systematically overconfident in their stated confidence, and smaller models — the ones you would most want to self-diagnose — have the worst calibration. Build escalation logic on deterministic tests, trained classifiers, or consistency checks, not on self-reported confidence.

Cascade / escalate-on-failure Cheap tier runs first; escalation is triggered only by a deterministic verifier

import json

def generate_with_fallback(prompt: str) -> str:
    response = local.chat.completions.create(
        model="qwen3-coder",
        messages=[{"role": "user", "content": prompt}],
    )
    result = response.choices[0].message.content

    try:
        json.loads(result)  # deterministic verifier
    except json.JSONDecodeError:
        response = berget.chat.completions.create(
            model="qwen-3.7-max",
            messages=[{"role": "user", "content": prompt}],
        )
        result = response.choices[0].message.content

    return result

3. Advisor (stronger-model assists)

A cheaper executor model drives the agent loop and consults a stronger model as a tool for hard sub-problems. The advisor sees the full context, returns a short plan or correction, and the executor continues generating the user-facing output.

Anthropic formalised this as a first-class API primitive in March 2026. Their Advisor Tool runs the consultation as a server-side sub-inference inside a single API call — no client-side orchestration required. Published benchmarks, which are first-party and worth validating on your own workload, show meaningful quality improvements at lower cost than running the strongest model end-to-end. The pattern itself is provider-agnostic — you can implement it with two Berget models (a smaller executor consulting Qwen 3.7 Max for hard sub-problems), or with a local executor consulting a hosted advisor.

The open-source predecessor is Aider's Architect/Editor mode, launched in September 2024. A reasoning model plans the change; a fast editor model emits the structured diffs. Pairing o1-preview as architect with DeepSeek as editor hit 85% on Aider's code-editing benchmark, then state of the art, at 30–50% lower cost than running the architect model end-to-end. Cline offers a similar Plan/Act split with optional different models per mode.

The trade-off is latency. The executor stream pauses while the advisor runs. This pattern pays off for long-horizon agentic tasks where only a small fraction of turns need the stronger model. It is a poor fit for single-turn Q&A or workloads where every turn needs the stronger model anyway.

Advisor (stronger-model assists) The executor drives the loop and consults the advisor only for hard sub-problems

def advisor_pattern(prompt: str) -> str:
    draft = local.chat.completions.create(
        model="qwen3-coder",
        messages=[{"role": "user", "content": prompt}],
    ).choices[0].message.content

    advice = berget.chat.completions.create(
        model="qwen-3.7-max",
        messages=[{
            "role": "user",
            "content": f"Review this code and suggest fixes:\n\n{draft}",
        }],
    ).choices[0].message.content

    final = local.chat.completions.create(
        model="qwen3-coder",
        messages=[
            {"role": "user", "content": prompt},
            {"role": "assistant", "content": draft},
            {"role": "user", "content": f"Apply these corrections: {advice}"},
        ],
    ).choices[0].message.content

    return final

4. Specialist consultation

A generalist model consults a specialist for capabilities it lacks: long context, vision, mathematical reasoning, domain expertise.

The difference from the advisor pattern is the type of help, not just the strength. A small local generalist could hand off to a much larger Berget-hosted model when the request needs a 200k-token context window the local model can't physically hold, or to a vision-capable model when an image is involved, or to a math-tool-equipped model for symbolic work. The pattern is also recursive — a generalist coding agent might use a documentation subagent that searches docs in its own context window and returns a summary, rather than dumping the documentation into the main agent's already-full context.

The friction is context passing. Do you forward the full transcript? Summarise and carry forward? Use a shared scratchpad? Each option has quality and cost implications, and there is no universal right answer.

Specialist consultation The generalist hands off to a specialist model after compressing the relevant context

def specialist_consultation(prompt: str, image_url: str | None = None) -> str:
    if image_url is None:
        return local.chat.completions.create(
            model="qwen3-coder",
            messages=[{"role": "user", "content": prompt}],
        ).choices[0].message.content

    return berget.chat.completions.create(
        model="qwen-3.7-max-vision",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {"url": image_url}},
            ],
        }],
    ).choices[0].message.content

5. Draft-and-verify

The local model proposes a complete output — a tool plan, a code block, a draft response. A hosted model verifies it in a single forward pass rather than regenerating from scratch. This is the agent-level analogue of speculative decoding — the technique where a small model proposes several tokens at a time and a larger model accepts or rejects them in one pass, reducing the number of expensive forward passes.

Several research implementations exist — AutoMix and EcoAssistant both explore variants — but the pattern is not yet available as an off-the-shelf framework. It works best for verifiable outputs where a cheap yes-no check is sufficient: code that must compile, JSON that must validate, math that can be checked against a calculator. The complexity is in building the verifier. If verification requires nearly as much computation as generation, the savings disappear.

Draft-and-verify The local model drafts a complete output; a stronger model verifies it in one forward pass

def draft_and_verify(prompt: str) -> str:
    draft = local.chat.completions.create(
        model="qwen3-coder",
        messages=[{"role": "user", "content": prompt}],
    ).choices[0].message.content

    verdict = berget.chat.completions.create(
        model="qwen-3.7-max",
        messages=[{
            "role": "user",
            "content": f"Is this valid JSON? Answer only YES or NO.\n\n{draft}",
        }],
    ).choices[0].message.content

    # Falls back to the cascade pattern (Pattern 2) if verification fails
    return draft if "YES" in verdict else generate_with_fallback(prompt)

The five patterns at a glance

PatternWhen to reach for itMain trade-offConcrete example
Feature-deterministic routingYou know the workload mix upfrontStatic; overpays for easy queriesContinue.dev per-role config
Cascade / escalate-on-failureCost-sensitive; verifiable tasksLatency on escalated pathFrugalGPT, retry-on-compile-failure
Advisor (stronger-model assists)Long-horizon agentic tasksStreaming pauses during consultationAnthropic Advisor Tool, Aider Architect/Editor
Specialist consultationGeneralist lacks a specific capabilityContext-passing complexityGeneralist → vision or long-context model
Draft-and-verifyVerifiable outputs; experimentalComplex to implement; not yet productisedResearch implementations (AutoMix)

Plumbing

Those five patterns need infrastructure to route between local and hosted inference. Here's what that looks like in practice with Berget.

For simple feature-routing or cascading, a gateway like LiteLLM is enough. It handles fallbacks, retries, cost tracking, and semantic caching. A two-tier fallback chain — local primary, Berget secondary — looks like this:

model_list:
  - model_name: default
    litellm_params:
      model: ollama/qwen3-coder
      api_base: http://localhost:11434
  - model_name: default
    litellm_params:
      model: openai/qwen-3.7-max
      api_base: https://api.berget.ai/v1
      api_key: os.environ/BERGET_API_KEY

router_settings:
  fallbacks:
    - default: [default]
  timeout: 30
  num_retries: 2

The exact syntax varies between LiteLLM versions; the shape is what matters. The same setup works with OpenRouter or Portkey if you prefer their tooling.

For the advisor pattern, you have two implementation options. The first is to do the orchestration in your own application code, calling Berget twice per turn — once for the executor, once for the advisor — and stitching the responses together. That's what the example in Pattern 3 shows. The second is to use a server-side primitive like Anthropic's Advisor Tool, which handles the consultation inside a single API call. We're investigating whether a similar primitive makes sense for Berget; for now the application-level pattern works fine.

There's also a vendor-neutral draft spec worth knowing about: MCP's sampling/createMessage lets an MCP server tool ask the client's LLM to run inference on its behalf, so tool servers can remain intelligent without their own API keys. It's still a draft, and Palo Alto Unit 42 has documented prompt-injection vectors through this path — sandbox it and require user approval per request if you adopt it.

Knowing when to escalate

The hardest part of a hybrid stack isn't writing the routing code. It's deciding when the cheap tier isn't good enough. There is no single robust way for a small model to know it is beyond its capabilities, and the published research is sobering.

Before walking through the methods, here's a concrete scenario to keep in mind. A coding agent receives the request: write a Python function that fetches data from an API endpoint with retry logic and exponential backoff. The local model has produced a draft. We need to decide whether to escalate that draft to Berget for a stronger model to redo or refine.

Three ways that decision could go:

  • Deterministic routing by task type. The router was configured ahead of time to send "generate a function with non-trivial control flow" tasks straight to Berget. The local model was never asked. Trivial to debug, no calibration problems — but easy tasks pay the same hosted cost as hard ones.
  • Self-verification and cascade. The local model writes the function. A verifier runs the code, checks that it imports cleanly, looks for a try/except block, and inspects whether time.sleep is called with an increasing argument. If any check fails, the same request goes to Berget. The verifier is the load-bearing piece — if it's lax, bad code slips through; if it's strict, you escalate too often.
  • Verbalised confidence. The local model is asked, after writing the function, "how confident are you in this code, on a scale of 1 to 10?" If the answer is below 8, escalate. This sounds reasonable and is the option teams reach for first. It also doesn't work. The model will answer 9 to nearly everything, including code that fails to parse, because language models are trained to sound confident regardless of whether they should be.

With the scenario in mind, here are the methods ranked by practical reliability:

  1. Deterministic routing by task type — the dominant pattern in shipped products. Trivial, debuggable, no calibration problems.
  2. Classifier-based pre-routing — a learned router, i.e. a small classifier trained on pairs of (query, which-model-answered-it-best). Projects like RouteLLM, Not Diamond, and Martian ship these. RouteLLM hits 95% of GPT-4 quality on MT-Bench with only 14–26% of queries escalated. Contemporary routers cluster within a narrow band of each other, though; most of the gain is from routing at all, not from the specific algorithm.
  3. Self-verification and cascade — the option from the scenario above. Cheap model answers, an external check (compilation, schema validation, retrieval recall) evaluates it, escalate on failure.
  4. Consistency-based uncertainty — generate multiple samples from the cheap model and measure semantic agreement. Disagreement is a signal that the model doesn't have a confident answer. Expensive but model-agnostic.
  5. Logprob and entropy thresholds — the model returns the probability it assigned to each token it generated; high uncertainty at decision points is a signal to escalate. Works only when the local serving stack exposes logprobs, and post-training (RLHF and similar) tends to degrade the calibration of those probabilities.
  6. Self-reported confidence — the option from the scenario above. The most attractive to reach for and the least reliable. Overconfidence is systematic and is worst on the smaller models you'd most want to self-diagnose.

The boundary between what runs locally and what needs Berget will shift as open-weight models continue to improve. What requires the largest hosted models today may be handled by something that fits on a workstation next year. That isn't a failure of the local end of the continuum — it's just where the capability frontier currently sits, and the frontier moves.

Start simple

You do not need five patterns on day one. Start with feature-deterministic routing. Pick one local model and one Berget model. For most teams, that pairing is the entire architecture. Instrument everything — per-model latency, cost, and a lightweight quality signal like compilation success or human thumbs-up rate.

Graduate to cascade or advisor when the data says the static split is missing opportunities. Add a learned router only if your own A/B testing shows it improves cost or quality by more than 10%. The architecture isn't fixed. The boundaries move.

Start with one pattern. Instrument it. Graduate when the data says so.