Skip to main content

Case study · retail

Internal agent platform for an enterprise operations team

How Thoughtwave stood up a governed multi-LLM agent platform for an enterprise ops team — sandboxed runtime, approval gates, full audit, zero vendor lock.

Days, not months

Time to build a governed agent

per workflow

Zero

Vendor lock-in

multi-LLM by design

100%

Audit coverage

every run logged

OpenAI, Claude, Gemini, local Llama

Supported LLMs

runtime-switchable

Context

A mid-market enterprise operations team had a shopping-list problem: a dozen candidate AI workflows across customer operations, vendor management, internal reporting, and engineering ops. Building each one as a standalone project would have consumed 18 months of engineering time and produced a dozen fragmented stacks, each with its own security posture, its own observability, and its own one-vendor LLM dependency.

The client's engineering leadership wanted the opposite: a single platform every team could build agents on, with governance and audit baked in, and the flexibility to swap the underlying LLM as the vendor landscape evolved.

Challenge

The specific requirements that drove architecture:

  • Sandboxed runtime. Every agent had to run in an isolated environment with constrained tool access. No agent could touch anything not explicitly granted in its definition.
  • Approval gates for destructive actions. Sending emails, moving money, or writing to production systems required documented approval from a designated reviewer.
  • Multi-LLM with no application rewrite. Swapping OpenAI for Claude, or either for a local Llama, should not require touching agent code.
  • Full audit log per run. Every agent invocation — the input, the plan, each tool call and result, the final output, who approved what — captured and retrievable.

Approach

Thoughtwave deployed TWSS AI Custom Agents — our production agent platform — as the foundation. The architecture maps cleanly onto the four requirements:

  1. Agent SDK with sandboxed runtime. Agents are defined as code (Python) with a declared goal, tool list, and approval-gate configuration. The runtime executes each agent in a Docker sandbox with the declared tools and nothing else.
  2. Tool and data connector library. Pre-built connectors for Slack, web APIs, internal databases, file shares, and major LLMs. Client-specific tools are added via the SDK.
  3. Approval workflows. Destructive actions pause execution and route an approval prompt to Slack or the platform web UI. The agent resumes only after explicit approval, with the approver's identity and reason captured in the trace.
  4. Multi-LLM router. Each agent specifies its preferred model plus a fallback list. The router selects per-call based on availability, cost policy, or task-specific quality criteria.

The engagement arc:

  • Platform setup (3 weeks). Stood up the platform on the client's infrastructure, integrated Slack for approvals, connected the initial tool catalog.
  • First agent (2 weeks). Built an operations triage agent that classifies incoming ops requests, routes to the right team, and drafts a response — as the proof that the platform works end-to-end.
  • Agent factory (ongoing). Trained the client's engineering teams on the SDK; the platform now ships an average of one new agent every 1-2 weeks, each passing the same governance bar as the first.

What we built

The production platform has five components:

  1. Python agent SDK. Declarative agent definitions with typed tool interfaces and approval-gate configuration.
  2. Sandboxed Docker runtime. Per-run isolation; agents cannot escape declared tools or reach resources not in the sandbox.
  3. MCP tool protocol layer. Standard protocol for tool definitions, making tools reusable across agents and portable between this platform and other MCP-compatible frameworks.
  4. Multi-LLM router. Cloud (OpenAI, Claude, Gemini) + local (Ollama) with per-agent model preference.
  5. Slack and web entry points. Agents can be triggered from Slack commands, from the web UI, from scheduled jobs, or from webhook events.

Outcomes

  • Build agents in days, not months. The first agent took two weeks; agents 2 through 10 have shipped in an average of 3-5 days each because the platform components carry over.
  • Governed, auditable enterprise deployment. Every run captured. Every destructive action gated. Every approval logged.
  • Multi-LLM portability. When the client's AI policy shifted mid-deployment, a model swap was a configuration change, not an engineering rewrite.
  • Zero vendor lock-in. The platform runs on client infrastructure; the agent definitions are client-owned code; the model layer is switchable.

What's next

The next phase extends the platform with automated evaluation: every production agent run feeds a regression suite, and model or prompt changes are tested against the full history before deployment. The client is also standing up a cross-team agent registry so other business units can adopt proven agent patterns without rebuilding.

For the broader portfolio of Thoughtwave production AI solutions that run on this and related platforms, see our accelerators portfolio.

Why a platform beats per-workflow tooling at scale

Most enterprises we engage with can name five candidate AI workflows. Some can name fifteen. Very few can name just one. The question is not whether to adopt AI agents — it is whether to adopt them one at a time (each as its own project, with its own security review, its own observability setup, its own vendor relationship) or on a platform that ships the first agent in weeks and every subsequent agent in days. The math favors the platform once the agent count reaches three or four, and it dominates once the agent count reaches ten.

The harder truth is that governance cannot be bolted on after the fact. Agents that read internal data and call external systems have an audit and security posture that has to exist from the first deployment. Retrofitting a sandbox, an approval gate, and a trace log onto ten in-production agents is an order of magnitude more work than building them into the platform once. The clients that wait on governance often have to halt their agent programs after the first security incident — not because the incident was catastrophic, but because there was no framework in place to decide how serious it was.

Frequently asked questions

What makes this different from buying a point-solution agent?
Point-solution agents solve one workflow; they do not compound. A platform approach lets the client ship agent #2, #3, and #20 against the same governance, observability, and approval controls. The first agent takes longer; every subsequent agent ships in days, not weeks. For an enterprise that expects to run many agents, this is the structurally better bet.
Why multi-LLM at the platform level?
Model quality, cost, and policy fit change quarter by quarter. Hard-wiring a single LLM at the platform layer means every agent has to be ported when the model changes. TWSS AI Custom Agents treats the model as a runtime-switchable component: OpenAI, Claude, Gemini, or a local Llama via Ollama — picked per agent, per task, or per policy.
How are destructive actions handled?
Every agent definition declares which actions require approval before execution. For consequential actions (send external email, money movement, irreversible data change), the platform pauses, surfaces the proposed action to a designated approver in Slack or the web UI, and only proceeds on explicit sign-off.
Does this compete with LangGraph or AutoGen?
It complements them. You can run a LangGraph or AutoGen workflow inside the TWSS platform when that is the right tool for the agent logic. What the platform adds is the enterprise-grade layer those frameworks do not: scoped sandboxing, approval gates, multi-LLM routing, and full audit across every run.

Related resources

RT
Ramesh Thumu

Founder & President, Thoughtwave Software

Reviewed by Thoughtwave Editorial

Last updated April 22, 2026