Skip to main content

What is a large language model (LLM)?

TL;DR

A large language model is a neural network trained on massive text corpora to predict the next token given the preceding context. That simple objective — predict the next token — produces models that can write, summarize, reason, answer questions, call tools, and generate code. Modern LLMs have billions to hundreds of billions of parameters and are accessed via API (OpenAI, Anthropic, Google) or self-hosted (Llama, Mistral, Qwen). The enterprise question is not which LLM is best in a leaderboard; it is which LLM's cost, quality, residency, and governance posture fits the specific workflow.

The short version

  • An LLM is a neural network trained to predict the next token given prior context.
  • Modern LLMs have billions to hundreds of billions of parameters and emerged from scaling the transformer architecture.
  • Enterprises consume LLMs via API or self-host; the choice is driven by data residency, cost, and governance.

The longer explanation

The training objective

LLMs train on the simplest imaginable objective: given a sequence of tokens, predict the next one. Scale that objective across trillions of tokens of training data and tens of thousands of GPU-hours, and the resulting model has internalized grammar, facts, reasoning patterns, and stylistic conventions well enough to respond coherently to prompts it has never seen.

The specific family most enterprise LLMs come from is the decoder-only transformer. The architectural details have evolved (sparse attention, mixture-of-experts, long-context extensions) but the core objective has not.

How enterprises consume LLMs

Two paths dominate:

API-hosted: OpenAI, Anthropic, Google, and others expose LLMs as cloud APIs. The upside is zero infrastructure work and instant access to the latest models. The downside is that data leaves the client environment, and the vendor's pricing, rate limits, and roadmap are outside the client's control. For most enterprise generative AI today, this is the starting point.

Self-hosted: Open-weight models (Llama from Meta, Mistral, Qwen, Gemma from Google DeepMind) run on client-owned infrastructure, typically via Ollama, vLLM, or TGI. The trade-off is more operational work in exchange for full data control. Regulated industries — banking, healthcare, government — often require this path for production workloads.

The capabilities that matter in 2026

  • Long context. Top models handle 200K-2M token contexts, enabling document-in, analysis-out workflows without chunking.
  • Tool use and function calling. LLMs can call external APIs as part of their response, which is the foundation for agentic workflows.
  • Structured output. Models can return JSON or other structured formats reliably, enabling integration with downstream systems.
  • Multimodal. Top models handle text plus image, and increasingly audio and video.

Choosing a model

The engineering decision is not "which is best overall" — it is "which fits this workload". The evaluation axes we use in client engagements:

  • Quality on the actual task. Run a scoped eval on the client's real data.
  • Latency. Interactive workflows need sub-second response; batch workflows have more slack.
  • Cost at projected volume. Frontier model pricing can swing total cost 10x; a right-sized smaller model often wins.
  • Data residency and governance. If data cannot leave the environment, the answer is self-hosted.
  • Vendor stability. Roadmap, pricing history, deprecation posture.

How Thoughtwave approaches this

We are model-neutral across OpenAI, Anthropic, Google, Meta, Mistral, and Qwen. Our engagements typically run a 2-3 model evaluation on the client's workload in the first two weeks and make the selection with the client based on the axes above. For production workloads with data-residency constraints, we deploy on client infrastructure via Ollama or vLLM — the pattern behind our self-hosted TWSS Commercial Credit AI platform.

For the broader context, see our AI & Generative AI service and the accelerators portfolio.

Frequently asked questions

How does an LLM actually work?
An LLM is trained to predict the probability distribution over the next token given the previous tokens. At inference time, the model is given a prompt and generates output one token at a time by sampling from that distribution. The rich structure it learns during training — grammar, facts, reasoning patterns, stylistic tendencies — emerges from the single objective of next-token prediction at massive scale.
Which LLMs should an enterprise evaluate?
The answer depends on workload. For breadth and quality, current Claude, GPT, and Gemini releases are the standard cloud options. For self-hosted deployments, Llama (Meta), Mistral, and Qwen are the leading open-weight families. For specialized domains (code, math, structured output), domain-tuned variants matter. We recommend running a scoped eval against 2-3 candidates on the client's actual workload before committing.
What are the cost levers?
Input token count (prompt length and retrieved context), output token count (response length), model tier (frontier models cost 10-30x more than smaller variants), caching (repeated prompt prefixes can cache), and batching. For high-volume workflows, the economics of a smaller model with targeted prompt engineering often beat the frontier model at lower cost.
Should we self-host an LLM?
Self-host when data residency, compliance, or deterministic control requires it. The pattern is mature: Ollama or vLLM on client GPUs, with Llama, Mistral, or Qwen weights. Our TWSS Commercial Credit AI runs a 3-model ensemble fully self-hosted (Qwen 2.5, Gemma 27B, Llama 3.3 70B) specifically to meet zero-external-API constraints for regulated lending.

Related resources

RT
Ramesh Thumu

Founder & President, Thoughtwave Software

Reviewed by Thoughtwave Editorial

Last updated April 22, 2026