Agents IA & automatisationApril 20, 2026

AI Agents in Production: MCP, Tool Use, and Orchestration

From single-agent tool calling to multi-agent orchestration with MCP — what actually works in enterprise deployments, and what still breaks.

From chatbots to autonomous agents

The shift from LLM-powered chatbots to agentic systems is less about model capability and more about plumbing. A modern agent is an LLM in a loop: it reasons, calls a tool, observes the result, and decides the next step. What changed in 2024-2025 is standardization — specifically Anthropic's Model Context Protocol (MCP), released in late 2024 and now adopted by OpenAI, Google DeepMind, and most major IDEs. MCP does for agent-tool integration what LSP did for editors: one protocol, N servers, M clients.

Before MCP, every agent framework (LangChain, LlamaIndex, Semantic Kernel, custom) wrapped tools differently. A Jira connector written for LangChain couldn't be reused by a CrewAI agent. With MCP, you write a server once and any compliant client — Claude Desktop, Cursor, your internal agent — can consume it.

The anatomy of a useful agent

A production-grade agent has four moving parts:

  1. A reasoning model (Claude Sonnet 4.5, GPT-4.1, Gemini 2.5 Pro) with strong tool-use benchmarks.
  2. Tools exposed via MCP servers, function calling, or OpenAPI schemas.
  3. Memory — short-term (conversation), long-term (vector store or structured), and episodic (past runs).
  4. A control loop with guardrails: max iterations, token budgets, approval gates for destructive actions.

Here's a minimal tool-calling loop using the Anthropic SDK with an MCP server:

from anthropic import Anthropic
from mcp_client import MCPClient

client = Anthropic()
mcp = MCPClient("stdio://./servers/jira-mcp")
tools = mcp.list_tools()  # auto-discovered

messages = [{"role": "user", "content": "Close all Jira tickets tagged 'stale' older than 90 days"}]

while True:
    resp = client.messages.create(
        model="claude-sonnet-4-5",
        tools=tools,
        messages=messages,
        max_tokens=4096,
    )
    if resp.stop_reason == "end_turn":
        break
    for block in resp.content:
        if block.type == "tool_use":
            result = mcp.call(block.name, block.input)
            messages.append({"role": "assistant", "content": resp.content})
            messages.append({"role": "user", "content": [
                {"type": "tool_result", "tool_use_id": block.id, "content": result}
            ]})

That's 20 lines. The complexity isn't in the loop — it's in the tools, the guardrails, and knowing when to stop.

Single-agent vs. multi-agent: when to split

Multi-agent orchestration is trendy, but a single well-equipped agent beats a poorly designed crew 80% of the time. Split into multiple agents only when you have clear reasons:

| Criterion | Single agent | Multi-agent | |---|---|---| | Task homogeneity | Similar subtasks | Distinct specializations (research, code, review) | | Context window | Fits in 200k tokens | Needs isolation to avoid pollution | | Parallelism | Sequential is fine | True parallel execution needed | | Debuggability | Easy to trace | Requires observability tooling (LangSmith, Langfuse) | | Cost | Lower | 2-5x higher token usage |

For multi-agent, LangGraph (graph-based, deterministic routing), CrewAI (role-based, higher-level), and OpenAI Swarm / Agents SDK (handoff pattern) are the current serious options. Anthropic's own research on their multi-agent research system showed a ~90% performance gain over single-agent on breadth-first tasks — but at roughly 15x token cost. Economics matter.

Enterprise use cases that are shipping

Past the demo phase, these are patterns we see working in production:

  • SRE triage agents: ingest PagerDuty alerts, query Datadog/Grafana via MCP, correlate with recent deploys, propose a runbook. Human approves remediation.
  • Tier-1 support deflection: agent with MCP access to Zendesk, internal KB, and order DB. Handles ~40% of tickets end-to-end at one retail client; escalates the rest with context pre-filled.
  • Code migration at scale: agents doing framework upgrades (Java 8 → 21, Angular → React) file-by-file with automated test runs as feedback. Much better than regex-based codemods for non-trivial transforms.
  • Financial close automation: agent reconciles GL entries against subledgers, flags anomalies, drafts journal entries for accountant review.
  • Sales ops: enrich leads, draft tailored outreach, update Salesforce — an orchestrator coordinates a research agent and a writing agent.

Production checklist

Before any agent touches production data, verify:

  • [ ] Tool permissions are scoped (read-only by default, write requires explicit allowlist)
  • [ ] Idempotency on write operations — retries are inevitable
  • [ ] Approval gates for irreversible actions (deletions, payments, emails to customers)
  • [ ] Budget caps: max tokens per run, max tool calls, wall-clock timeout
  • [ ] Observability: trace every LLM call and tool invocation (OpenTelemetry + Langfuse or LangSmith)
  • [ ] Prompt injection defenses — treat tool outputs as untrusted input; sandbox web content
  • [ ] Evaluation suite: at least 50 scenarios with expected outcomes, run on every prompt/model change
  • [ ] Rollback plan: feature flag the agent, ability to disable tools without redeploy

What's still hard

Three problems remain genuinely unsolved. Long-horizon reliability: agents still degrade past 20-30 steps; error compounds. Cost predictability: a single user request can cost $0.05 or $5 depending on how the agent explores. Evaluation: there's no equivalent to unit tests for open-ended agent behavior — golden traces and LLM-as-judge help but aren't sufficient.

Tools like Arize Phoenix, Braintrust, and Langfuse are closing the observability gap, but expect to invest 30-40% of engineering effort on the platform around the agent, not the agent itself.

Key takeaways

  • MCP is the new standard for agent-tool integration — build MCP servers, not framework-specific adapters.
  • Start single-agent, graduate to multi-agent only when isolation, parallelism, or specialization demands it.
  • Guardrails and observability matter more than the model — a weaker model with good plumbing beats a stronger model in a naive loop.
  • Budget for 15x token cost when moving from single to multi-agent architectures; measure ROI per workflow.
  • Treat agent engineering as platform engineering: evals, tracing, feature flags, and rollback are non-negotiable.
Share this article

Read also