TL;DR
- Not every workload needs a frontier model. For most enterprise work, middle-tier models at 5-15% of the frontier cost deliver equivalent quality.
- The decision runs on five axes: workload complexity, data-residency, latency, cost-at-volume, and governance posture.
- Open-weight models (Llama, Qwen, Gemma, Mistral) now rival paid options on specific categories and are the right default for regulated workloads.
- Paid frontier models (Claude Opus, GPT-5/4.x family, Gemini Ultra/Pro) earn their cost on genuinely hard reasoning, long-context analysis, and novel multimodal tasks.
- Most enterprise AI programs should run a model-router layer that picks per workload rather than standardizing on a single vendor.
The problem: one-model enterprise AI is expensive and brittle
Enterprise AI procurement has a default failure mode: pick one vendor, standardize on their frontier model, and run every workload through it. The result is predictable. Token costs balloon because classification tasks run on a reasoning model designed for complex analysis. Latency is inconsistent because the routing layer that could batch simple requests doesn't exist. Data-residency problems accumulate because regulated workloads end up on the same cloud API as general productivity workloads. When the chosen vendor raises prices, deprecates a model, or has a capacity outage, the entire AI program is exposed.
The enterprises getting AI economics right in 2026 run a different pattern: a model-router layer that classifies each request by workload type, data sensitivity, and required quality tier, then dispatches to the appropriate model. The model might be Claude for a complex analytical task, GPT for a general drafting task, a self-hosted Llama 3.3 70B for a regulated workload, or a smaller specialist model (Qwen 2.5 32B Coder for code generation) for a narrowly-scoped task. The router is the enterprise AI architecture that actually scales.
This paper walks through the decision framework for that router.
Clearing up terminology first
Before the matrix, one clarification that catches many buyers off-guard.
Ollama is a serving platform, not a model. Ollama (along with vLLM, TensorRT-LLM, and TGI) is the operational layer that hosts and serves open-weight models on client infrastructure. When you read "we're running Ollama for AI," what the team actually means is "we're serving Llama 3.3 70B (or Qwen 2.5, or Gemma 2, or Mistral) via Ollama." The model is where the reasoning happens; Ollama is the runtime. Model selection and serving-platform selection are separate decisions.
Microsoft Copilot is a product, not a model. Microsoft Copilot (including Microsoft 365 Copilot, Copilot in Windows, Copilot in Dynamics, etc.) is built on the GPT family under the Azure OpenAI Service. When you pay for Copilot, you are paying for a Microsoft-branded experience layered over OpenAI models. The model underneath is (versions of) GPT-4 or GPT-5 family. The Copilot-vs-ChatGPT decision is really about the surface and the integration, not the underlying model.
Meta AI is a product, the models are Llama. Meta's consumer AI product is named "Meta AI" and runs on Llama models. For enterprise deployments, the relevant integration is Llama (the open-weight model) via a serving platform (Ollama, vLLM) or via a hosted API (Meta's partner access, Azure AI Foundry, AWS Bedrock).
The five axes of model selection
1. Workload complexity
How hard is the reasoning task? Classification of a support ticket into one of eight categories is trivial. Drafting a boilerplate email is easy. Multi-step analytical reasoning across a 100-page document is hard. Novel problem-solving with tool use and planning is frontier territory.
Matching workload complexity to model capability is the single biggest lever. A frontier model running a classification task pays ~10× what a middle-tier model would cost for the same quality outcome.
2. Data residency and regulation
Can the data legally and contractually flow to a cloud LLM vendor? For HIPAA-regulated health data, GLBA financial data, FedRAMP-scoped government workloads, or data under specific customer-contract restrictions, the answer is often no for the cloud vendors without BAA/compliance coverage. For those workloads, self-hosted open-weight models are not a preference — they're a precondition.
3. Latency
What's the response-time SLA? Interactive user-facing workloads need sub-second first-token. Batch workloads have more slack. Long-context deep-analysis tasks measure in tens of seconds. Self-hosted models can win on latency floor (no network round-trip) but often lose on throughput elasticity.
4. Cost at projected volume
What's the monthly token volume, and how does the cost scale? At low volume (under a few million tokens/month), cloud API pricing is unbeatable. At high volume (tens of millions of tokens/month and up), self-hosted economics start winning. The crossover depends on the specific model and GPU hardware.
5. Governance posture
What audit, evaluation, and content-safety controls does the deployment need? Cloud vendors provide substantial baseline; enterprise-grade governance still requires client-side controls on top. Self-hosted deployments require the client to build the governance layer themselves (which many regulated clients prefer, because they own the posture end-to-end).
The matrix: workload × recommended model tier
The recommendations below assume mid-2026 model capabilities. Specific model versions evolve fast; the tier structure is the durable framework.
Tier A: Frontier (Claude Opus class, GPT-5 class, Gemini Ultra class)
Use when: the reasoning task genuinely exceeds middle-tier capability. Complex multi-step analysis. Novel problem-solving with tool use across many steps. Long-context deep analysis across 200K+ tokens. Agentic workflows with high-stakes action chains.
Typical cost: $3-15 per million input tokens, $15-75 per million output tokens.
Workloads that earn it: Legal-contract complex analysis. Strategic-research deep synthesis. Difficult code-generation requiring strong reasoning. Agentic workflows where tool-call reliability on novel situations is the bottleneck.
Tier B: Middle-tier cloud (Claude Sonnet, GPT-4 family current, Gemini Pro, Gemini Flash)
Use when: standard enterprise quality needed. Most drafting, summarization, RAG answering, moderately complex reasoning, classification with edge cases, straightforward agentic workflows.
Typical cost: $0.15-3 per million input tokens, $0.60-15 per million output tokens — 5-15× cheaper than frontier.
Workloads that earn it: The 60-70% of enterprise AI workloads. Customer service copilots. Knowledge-base RAG answering. Drafting assistants. Document extraction with validation. Most structured-output workloads.
Tier C: Self-hosted large open-weight (Llama 3.3 70B, Qwen 2.5 72B)
Use when: data residency, vendor independence, or cost-at-volume matters. Quality is comparable to Tier B cloud on most enterprise workloads.
Typical cost: amortized GPU cost. Break-even with cloud APIs at high volume (typically 20M+ tokens/day).
Workloads that earn it: Regulated workloads (HIPAA, GLBA, FedRAMP). High-volume production workloads where token economics dominate. Strategic data workloads where vendor independence is a design principle. Our TWSS Commercial Credit AI runs a 3-model ensemble in this tier.
Tier D: Self-hosted medium (Llama 3.3 8B-13B, Qwen 2.5 32B, Gemma 2 27B, Mistral Medium)
Use when: a narrower workload fits a smaller model's capability envelope and the cost-per-inference matters.
Typical cost: materially cheaper than Tier C to serve. Same infrastructure, larger batch sizes, lower memory footprint.
Workloads that earn it: Narrow classification. Domain-tuned extraction. Internal tool-use where the workload is scoped. Part of an ensemble where specific sub-tasks route here. Our Commercial Credit AI uses Gemma 27B specifically for narrative-analysis sub-tasks.
Tier E: Small specialized (Llama 3.2 3B, Qwen 2.5 Coder 7B, Gemma 2 9B, Phi 4)
Use when: the task is narrow, high-volume, or needs to run on edge devices.
Typical cost: runs on modest hardware. Often deployable on CPU for truly narrow tasks.
Workloads that earn it: Real-time classification. On-device inference. High-volume ETL-adjacent tasks. Code completion inside an IDE. Structured extraction from well-understood document types.
The model-router pattern in practice
A typical enterprise AI program in 2026 looks like this:
┌─────────────────────────────────────────────────────────┐
│ Model Router (classification layer, ~5ms overhead) │
├─────────────────────────────────────────────────────────┤
│ Classify by: │
│ - Workload complexity │
│ - Data sensitivity │
│ - Latency requirement │
│ - Volume tier │
│ │
│ Route to: │
│ - Frontier tier for hard reasoning (2-5% of traffic) │
│ - Middle-tier cloud for general work (60-70%) │
│ - Self-hosted large for regulated (15-25%) │
│ - Self-hosted small for narrow/high-volume (5-15%) │
└─────────────────────────────────────────────────────────┘
The router is a thin layer, often just a classification prompt plus routing logic. Its cost is negligible relative to the savings it delivers.
Specific scenario guidance
Customer service case resolution on standard tickets. Tier B (Claude Sonnet or GPT-4 family) via cloud API, with PII redaction pre-send. Escalate to Tier A only for rare edge cases requiring deep analysis. Self-host (Tier C) if the client industry is regulated.
Email triage for shared inboxes. Tier D or E for initial classification, Tier B for draft generation. Batch classification workloads benefit massively from smaller models.
Commercial credit underwriting. Self-hosted ensemble — Tier C for primary reasoning, Tier D for specialized sub-tasks. Our TWSS Commercial Credit AI runs this pattern.
Legal contract deep analysis. Tier A for the hard reasoning, Tier B for extraction and summarization. The value per contract typically justifies the frontier-model cost for the analysis step.
RAG-grounded Q&A over internal docs. Tier B is the default. Escalate to Tier A only for complex multi-hop questions. Self-host (Tier C) for regulated content.
Code completion inside IDE. Tier D or E specialized code model (Qwen 2.5 Coder). Tier A is almost never right for this workload.
Agentic workflow with tool calls. Tier A for hard planning, Tier B for routine tool-use patterns. Claude's tool-calling reliability typically wins agentic evaluations today.
High-volume structured extraction. Tier D self-hosted. Cost-per-extraction dominates; middle-tier cloud often works but self-hosted wins economics at volume.
Multilingual enterprise content. Tier B (Gemini Pro or Claude Sonnet for breadth) or Tier C self-hosted (Qwen for strong non-English performance).
Thoughtwave's default recommendation
For most enterprises starting an AI program, we recommend the following sequence:
- Audit workloads first. Before choosing any model, classify the target workloads by complexity, sensitivity, volume, and latency requirements.
- Run a scoped Tier B pilot. Use middle-tier cloud as the default first-deployment tier. Prove value before investing in frontier or self-hosted infrastructure.
- Add self-hosted (Tier C) for regulated workloads. Our TWSS Commercial Credit AI architecture — Ollama + 3-model ensemble, zero external API calls, MBE/GSA-ready — is the reference.
- Add frontier (Tier A) surgically. Reserve for workloads where middle-tier quality has been empirically evaluated and found insufficient. This is typically 2-5% of total AI traffic.
- Build the router. Once multiple models are in production, the router becomes the architectural backbone. Without it, you're paying frontier rates for classification tasks.
The durable principle
The AI-model landscape will continue evolving. Specific model versions will deprecate. New entrants will emerge. Pricing tiers will shift. What remains durable is the framework: match workload to model tier, pay frontier rates only where reasoning difficulty genuinely demands it, and build a router architecture that can absorb model change without workflow rewrite.
Our engagements build this from day one. The cost discipline compounds over years of AI program maturation.
For deeper context, see our cloud vs self-hosted LLM comparison, the AWS vs Azure vs GCP for AI comparison, and our enterprise AI cost structure insight. For reference implementations running the multi-tier pattern in production, see our Commercial Credit AI case study and the broader accelerators portfolio.