LLM deployment, done properly
Running an LLM in production is not running an API call in a loop. The infrastructure, observability, cost management, and operational posture decisions that make the difference between a working demo and a durable production system do not show up in the vendor quick-starts. Our LLM deployment services cover that full lifecycle — model selection, deployment architecture, observability, and ongoing operation.
What we cover
Model selection
A 1-2 week evaluation on the client's actual workload across 2-3 candidate models. The axes that drive the choice:
- Quality on the task, measured against a representative test set.
- Latency at target concurrency.
- Cost at projected volume.
- Data residency and vendor-risk posture.
- Capability fit (tool use, long context, multimodal, structured output).
The output is a specific recommendation with the evidence attached, not a leaderboard citation.
Deployment architecture
Cloud deployments. Direct API integration for low-volume or early-stage workloads. API-gateway pattern with per-client routing, caching, and audit for higher volume. BAA and DPA terms reviewed against the compliance regime.
Self-hosted deployments. Ollama, vLLM, or TGI on client-owned GPUs (H100, H200, MI300 as appropriate). Multi-GPU for larger models. Orchestration behind FastAPI or equivalent. Model-weights management with versioned deployments.
Hybrid deployments. Router in front that classifies each request and sends it to the appropriate model based on content sensitivity. Captures the ease of cloud for non-sensitive work and the control of self-hosted for regulated work.
Observability and operations
Every production deployment includes:
- Token and cost metrics. Per-application, per-tenant, per-model.
- Latency distribution. Time-to-first-token and full-response percentiles.
- Confidence and grounding metrics for RAG workloads.
- Evaluation pipeline. Regression suites run on every prompt or model change.
- Drift monitoring. Alerts when production performance degrades versus baseline.
- Incident response. Runbooks, on-call, and rollback capability.
Cost optimization
The levers we tune for clients:
- Prompt engineering to reduce input tokens.
- Response constraints to reduce output tokens.
- Caching on repeated prompt prefixes.
- Model tiering — smaller models for simpler tasks, frontier models only where quality demands.
- For self-hosted, GPU utilization and batching optimization.
For a full breakdown of where enterprise AI budgets actually go, see the real cost structure of enterprise AI.
Self-hosted reference deployment
TWSS Commercial Credit AI is the reference architecture we transfer into regulated client deployments:
- 3-model ensemble: Qwen 2.5, Gemma 27B, Llama 3.3 70B.
- Served via Ollama on client-owned GPU infrastructure.
- 14-service Docker Compose stack: Postgres, Redis, MinIO/S3, orchestration services, UI.
- Zero external API calls — 100% of inference happens on client infrastructure.
- Full audit per transaction; MBE and GSA procurement-ready.
See the case study for the detail.
When to self-host vs cloud
The decision framework lives in the case for self-hosted AI and the cloud vs self-hosted LLMs comparison. Short version: regulatory, contractual, competitive, availability, or cost-at-scale — any one of these flips the answer toward self-hosted.
What deployment engagements typically look like
- 2-week model evaluation and architecture. Output: recommendation with evidence.
- 4-6 week initial deployment. First production workload live on the chosen architecture.
- Ongoing operation. Monthly review, cost and performance optimization, model-swap support when vendor landscape shifts.
Why Thoughtwave
- Vendor-neutral. Cloud, self-hosted, or hybrid — driven by the workload, not our partnerships.
- Production track record across a range of deployment shapes, documented in the accelerators portfolio.
- Full lifecycle coverage — not just the initial deploy but the ongoing operation that keeps the system working.
For broader context, see our AI & Generative AI service and the accelerators portfolio. To start a conversation, book a consultation.