LLM Integration Patterns That Survive Production
RAG, agents, observability, cost controls and guardrails: the architectural choices that separate a working demo from a reliable enterprise LLM system.
Most enterprise LLM projects don't fail because the model is too weak. They fail because teams ship a notebook-grade prototype straight into production, then discover that hallucinations, runaway token bills, and silent regressions are not edge cases — they are the default state.
After two years of shipping GenAI features for regulated industries, a small set of patterns consistently separates the systems that survive from the ones that get rolled back. Here they are.
1. RAG: stop treating it as a search problem
The naive RAG pipeline — embed everything, top-k cosine similarity, stuff into the prompt — works for demos and breaks at scale. The bottleneck is rarely the vector store; it's chunking strategy, retrieval quality, and reranking.
What actually moves the needle in 2025:
- Hybrid retrieval: combine BM25 (via OpenSearch or Elastic) with dense vectors. Pure semantic search misses exact identifiers, error codes, SKUs.
- Reranking: a cross-encoder like Cohere Rerank 3 or
bge-reranker-v2-m3on the top 50 candidates, returning the top 5. Expect a 15–30% jump in answer relevance. - Structured chunking: parse documents with tools like Unstructured.io or LlamaParse, preserving headings, tables, and section hierarchy as metadata. Then filter retrieval by metadata, not just similarity.
- Query rewriting: use a cheap model (GPT-4o-mini, Claude Haiku) to expand or decompose the user query before retrieval.
If you can't measure retrieval precision and recall on a labeled set, you don't have a RAG system — you have a hope.
2. Agents: use them sparingly, and bound them hard
Multi-agent frameworks (LangGraph, CrewAI, AutoGen) are seductive. They are also where projects burn the most tokens for the least business value. Before adopting an agentic pattern, ask: can this be solved with a deterministic workflow plus one or two LLM calls?
When agents genuinely help — open-ended research, code generation, complex tool orchestration — enforce these constraints:
- Hard step limits (max 8–12 iterations) with circuit breakers.
- Explicit tool schemas using JSON Schema or Pydantic, validated on every call.
- State machines, not free-form loops. LangGraph's graph model beats ReAct loops for predictability.
- Human-in-the-loop checkpoints for any irreversible action (write to DB, send email, execute code).
3. Observability is non-negotiable
You cannot debug an LLM system with print() statements. You need traces, evaluations, and replay.
Minimum stack:
| Concern | Tool options | |---|---| | Tracing | Langfuse, LangSmith, Arize Phoenix, OpenLLMetry | | Evals | Ragas, DeepEval, Promptfoo | | Prompt versioning | Langfuse, PromptLayer, Git + structured YAML | | Drift detection | Arize, WhyLabs, custom dashboards on top of traces |
Langfuse and Phoenix are both open-source and OpenTelemetry-compatible, which matters when your security team refuses to send prompts to a SaaS.
Instrument every call with:
from langfuse.decorators import observe
from langfuse.openai import openai
@observe()
def answer_question(query: str, user_id: str):
context = retrieve(query)
response = openai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Context: {context}\n\nQ: {query}"}
],
metadata={"user_id": user_id, "retrieval_k": len(context)}
)
return response.choices[0].message.content
The metadata fields are what let you slice cost, latency, and quality by tenant, feature, or prompt version six months later.
4. Cost control: model the unit economics on day one
A single chatty agent with GPT-4o and a 32k context can cost €0.50–€2.00 per session. Multiply by 10,000 daily users and the CFO will notice.
Practical levers, in order of impact:
- Model routing: send 70–80% of traffic to small models (GPT-4o-mini, Claude Haiku, Gemini Flash, Mistral Small). Reserve frontier models for hard queries, detected by a classifier or confidence score.
- Prompt caching: Anthropic and OpenAI both support it natively. A 90% discount on cached tokens transforms the economics of long system prompts.
- Semantic caching for repeated queries (GPTCache, Redis with vector similarity).
- Token budgets per request, enforced at the gateway (LiteLLM, Portkey, or a custom proxy).
- Streaming + early termination when the answer is clearly off-track.
Track cost per successful interaction, not cost per token. The two diverge fast.
5. Guardrails: defense in depth
Prompt injection, data exfiltration, PII leakage, and toxic output are not theoretical. Layer your defenses:
- Input filtering: detect prompt injection patterns (NVIDIA NeMo Guardrails, Lakera Guard, or Llama Guard 3 self-hosted).
- Output validation: structured output via JSON Schema + Pydantic, plus content classifiers before returning to the user.
- PII redaction on both input and output (Microsoft Presidio is the open-source baseline).
- Least privilege on tools: an agent that can read your CRM should not be able to write to it without an additional approval step.
- Audit log of every prompt, retrieved context, and output, retained per your compliance policy (GDPR, EU AI Act Article 12).
Production readiness checklist
- [ ] Eval suite with ≥100 labeled examples, run on every prompt change
- [ ] Tracing on 100% of production traffic
- [ ] Cost per request dashboard, alerted at 2x baseline
- [ ] Prompt injection tests in CI (Promptfoo or similar)
- [ ] Fallback path when the LLM provider is degraded (multi-provider via LiteLLM)
- [ ] PII redaction verified on a representative sample
- [ ] Documented rollback procedure for prompt and model changes
Key takeaways
- RAG quality lives or dies on retrieval, not on the LLM. Invest in hybrid search and reranking before swapping models.
- Agents are a last resort, not a default. Bound them with step limits, typed tools, and human checkpoints.
- Observability with Langfuse, Phoenix, or LangSmith is the prerequisite for everything else — evals, cost control, debugging.
- Model routing and prompt caching are the two highest-ROI cost optimizations available today.
- Guardrails must be layered: input filtering, output validation, PII redaction, and least-privilege tooling. No single control is enough.
Read also
- Click & CollectJune 12, 2026
Click & Collect: How to Let Customers Order Online and Pick Up in Store
Hélène runs a cheese shop in Annecy. Last winter she added online ordering with in-store pickup. Here's exactly how she did it — and what it cost.
Read article - Agents IA & automatisationJune 11, 2026
AI Agents in Production: MCP, Tool Use, and Orchestration
Beyond the demos: how to architect autonomous agents with MCP, tool use, and multi-agent orchestration for real enterprise workloads.
Read article - l'IA et les développeur June 8, 2026
Will AI Replace Every Developer Job? A Realistic Take
GitHub Copilot writes code, Devin opens PRs, Cursor refactors entire repos. So which dev jobs actually disappear by 2027 — and which become more valuable?
Read article
