Agents IA & automatisationApril 30, 2026

Building Production AI Agents: MCP, Tools & Orchestration

Autonomous agents are leaving the demo phase. Here's what actually works in production: MCP, tool use patterns, and multi-agent orchestration.

The agent hype of 2023 produced impressive demos and very few production systems. Two years later, the landscape has matured: Anthropic's Model Context Protocol (MCP) is becoming a de facto standard for tool integration, OpenAI shipped the Agents SDK and Responses API, and frameworks like LangGraph, CrewAI and Pydantic AI are running real workloads. This post is a pragmatic look at what's working in enterprise deployments — and where teams are still hitting walls.

From chatbots to agents: what actually changed

An agent is not just an LLM with a system prompt. The working definition we use at DCT:

> A system where an LLM dynamically decides the control flow, calls tools to interact with external systems, and iterates until a goal is reached or a stop condition triggers.

Three shifts made this viable in 2024-2025:

Reliable tool calling — Claude 3.5+, GPT-4.1 and Gemini 2.0 hit >95% accuracy on structured function calls.
Long context with caching — 200K+ token windows with prompt caching cut costs by 5-10x for agent loops.
MCP standardization — instead of writing custom adapters for every system, you expose capabilities through a single protocol.

MCP: the USB-C of tool integration

Before MCP, every agent framework had its own tool format. Want your agent to query Postgres, call Jira, and read SharePoint? Three custom integrations, three sets of auth, three maintenance burdens.

MCP defines a client-server protocol where servers expose tools, resources and prompts. The same MCP server works with Claude Desktop, Cursor, your custom LangGraph agent, or anything else that speaks the protocol.

A minimal MCP server in Python:

from mcp.server.fastmcp import FastMCP
import asyncpg

mcp = FastMCP("company-data")

@mcp.tool()
async def query_orders(customer_id: str, limit: int = 10) -> list[dict]:
    """Fetch recent orders for a given customer."""
    conn = await asyncpg.connect(DSN)
    rows = await conn.fetch(
        "SELECT id, total, status, created_at FROM orders "
        "WHERE customer_id = $1 ORDER BY created_at DESC LIMIT $2",
        customer_id, limit,
    )
    await conn.close()
    return [dict(r) for r in rows]

if __name__ == "__main__":
    mcp.run(transport="stdio")

That's the entire integration. Any MCP-compatible agent can now call query_orders with proper schema validation, auth boundaries, and audit logging handled by the protocol.

Single agent vs. multi-agent: choose carefully

Multi-agent architectures are oversold. Most enterprise problems are solved better by one well-designed agent with good tools than by five agents passing messages around.

| Pattern | When to use | Risk | |---|---|---| | Single agent + tools | Linear or branching tasks, one domain | Tool count > 30 hurts accuracy | | Supervisor + workers | Parallelizable subtasks, distinct skills | Coordination overhead, token cost 3-5x | | Sequential pipeline | Deterministic stages (extract → validate → write) | Often a workflow, not an agent | | Swarm / handoff | Specialist routing (sales → support → billing) | Hard to debug, loops |

A good heuristic from Anthropic's engineering team: start with the simplest workflow that works, add agentic behavior only where the task genuinely requires dynamic decisions.

A real case: invoice processing at a mid-market manufacturer

One of our clients processes ~8,000 supplier invoices per month. The previous OCR + rules pipeline had 72% straight-through processing. The agent-based replacement, built on LangGraph with three MCP servers (ERP, document store, vendor master), reached 91% STP after six weeks of tuning.

Key design choices:

One agent, not many. A supervisor pattern was tested first and produced more loops without accuracy gains.
Bounded tool set (11 tools): read invoice, match PO, check vendor, post to ERP, escalate to human, etc.
Hard step limit of 15 iterations with a fallback to human review.
Structured output at the boundary — the agent's final state is a Pydantic model, not free text.
Eval harness with 400 labeled invoices, run on every prompt or model change.

Cost: ~$0.04 per invoice on Claude Sonnet 4.5 with prompt caching. ROI was positive in month two.

Production checklist

Before you put an agent in front of real users or real money, verify:

[ ] Eval set with at least 100 representative cases, scored automatically
[ ] Step and budget limits (max iterations, max tokens, max tool calls)
[ ] Idempotent or reversible tool calls — or human approval before destructive actions
[ ] Structured tracing (LangSmith, Langfuse, Arize) — you cannot debug agents without it
[ ] Auth scoping — the agent has only the permissions a junior employee on that task would have
[ ] Fallback path to human or deterministic logic when confidence is low
[ ] Cost ceiling per session, enforced at the orchestration layer
[ ] Prompt and model version pinning in the deployment manifest

Where agents still fail

Be honest with stakeholders about the failure modes that haven't been solved:

Long-horizon planning beyond 20-30 steps degrades sharply.
Self-correction loops — agents often repeat the same wrong action with slight variations.
Multi-modal grounding — reading complex tables and diagrams remains noisy.
Cost predictability — a single bad reasoning chain can 10x the cost of a request.

These are addressed with engineering (limits, evals, human-in-the-loop), not by waiting for a smarter model.

Key takeaways

MCP is worth adopting now — it decouples your tool integrations from any specific agent framework or model vendor.
Default to a single agent with a well-designed tool set. Reach for multi-agent only when subtasks are genuinely parallel or require distinct expertise.
Evals and tracing are non-negotiable. Without them, you're shipping a system you cannot debug or improve.
Bound everything — steps, tokens, cost, blast radius of tool calls. Agents will surprise you.
Measure ROI on real workloads, not demos. The 70-to-90% accuracy jump is where the business case lives, and it requires weeks of tuning, not a weekend hackathon.

Share this article