AI Agents in Production: MCP, Tool Use, and Orchestration
From single-agent tool calling to multi-agent orchestration with MCP — what actually works in enterprise deployments, and what still breaks.
From chatbots to autonomous agents
The shift from LLM-powered chatbots to agentic systems is less about model capability and more about plumbing. A modern agent is an LLM in a loop: it reasons, calls a tool, observes the result, and decides the next step. What changed in 2024-2025 is standardization — specifically Anthropic's Model Context Protocol (MCP), released in late 2024 and now adopted by OpenAI, Google DeepMind, and most major IDEs. MCP does for agent-tool integration what LSP did for editors: one protocol, N servers, M clients.
Before MCP, every agent framework (LangChain, LlamaIndex, Semantic Kernel, custom) wrapped tools differently. A Jira connector written for LangChain couldn't be reused by a CrewAI agent. With MCP, you write a server once and any compliant client — Claude Desktop, Cursor, your internal agent — can consume it.
The anatomy of a useful agent
A production-grade agent has four moving parts:
- A reasoning model (Claude Sonnet 4.5, GPT-4.1, Gemini 2.5 Pro) with strong tool-use benchmarks.
- Tools exposed via MCP servers, function calling, or OpenAPI schemas.
- Memory — short-term (conversation), long-term (vector store or structured), and episodic (past runs).
- A control loop with guardrails: max iterations, token budgets, approval gates for destructive actions.
Here's a minimal tool-calling loop using the Anthropic SDK with an MCP server:
from anthropic import Anthropic
from mcp_client import MCPClient
client = Anthropic()
mcp = MCPClient("stdio://./servers/jira-mcp")
tools = mcp.list_tools() # auto-discovered
messages = [{"role": "user", "content": "Close all Jira tickets tagged 'stale' older than 90 days"}]
while True:
resp = client.messages.create(
model="claude-sonnet-4-5",
tools=tools,
messages=messages,
max_tokens=4096,
)
if resp.stop_reason == "end_turn":
break
for block in resp.content:
if block.type == "tool_use":
result = mcp.call(block.name, block.input)
messages.append({"role": "assistant", "content": resp.content})
messages.append({"role": "user", "content": [
{"type": "tool_result", "tool_use_id": block.id, "content": result}
]})
That's 20 lines. The complexity isn't in the loop — it's in the tools, the guardrails, and knowing when to stop.
Single-agent vs. multi-agent: when to split
Multi-agent orchestration is trendy, but a single well-equipped agent beats a poorly designed crew 80% of the time. Split into multiple agents only when you have clear reasons:
| Criterion | Single agent | Multi-agent | |---|---|---| | Task homogeneity | Similar subtasks | Distinct specializations (research, code, review) | | Context window | Fits in 200k tokens | Needs isolation to avoid pollution | | Parallelism | Sequential is fine | True parallel execution needed | | Debuggability | Easy to trace | Requires observability tooling (LangSmith, Langfuse) | | Cost | Lower | 2-5x higher token usage |
For multi-agent, LangGraph (graph-based, deterministic routing), CrewAI (role-based, higher-level), and OpenAI Swarm / Agents SDK (handoff pattern) are the current serious options. Anthropic's own research on their multi-agent research system showed a ~90% performance gain over single-agent on breadth-first tasks — but at roughly 15x token cost. Economics matter.
Enterprise use cases that are shipping
Past the demo phase, these are patterns we see working in production:
- SRE triage agents: ingest PagerDuty alerts, query Datadog/Grafana via MCP, correlate with recent deploys, propose a runbook. Human approves remediation.
- Tier-1 support deflection: agent with MCP access to Zendesk, internal KB, and order DB. Handles ~40% of tickets end-to-end at one retail client; escalates the rest with context pre-filled.
- Code migration at scale: agents doing framework upgrades (Java 8 → 21, Angular → React) file-by-file with automated test runs as feedback. Much better than regex-based codemods for non-trivial transforms.
- Financial close automation: agent reconciles GL entries against subledgers, flags anomalies, drafts journal entries for accountant review.
- Sales ops: enrich leads, draft tailored outreach, update Salesforce — an orchestrator coordinates a research agent and a writing agent.
Production checklist
Before any agent touches production data, verify:
- [ ] Tool permissions are scoped (read-only by default, write requires explicit allowlist)
- [ ] Idempotency on write operations — retries are inevitable
- [ ] Approval gates for irreversible actions (deletions, payments, emails to customers)
- [ ] Budget caps: max tokens per run, max tool calls, wall-clock timeout
- [ ] Observability: trace every LLM call and tool invocation (OpenTelemetry + Langfuse or LangSmith)
- [ ] Prompt injection defenses — treat tool outputs as untrusted input; sandbox web content
- [ ] Evaluation suite: at least 50 scenarios with expected outcomes, run on every prompt/model change
- [ ] Rollback plan: feature flag the agent, ability to disable tools without redeploy
What's still hard
Three problems remain genuinely unsolved. Long-horizon reliability: agents still degrade past 20-30 steps; error compounds. Cost predictability: a single user request can cost $0.05 or $5 depending on how the agent explores. Evaluation: there's no equivalent to unit tests for open-ended agent behavior — golden traces and LLM-as-judge help but aren't sufficient.
Tools like Arize Phoenix, Braintrust, and Langfuse are closing the observability gap, but expect to invest 30-40% of engineering effort on the platform around the agent, not the agent itself.
Key takeaways
- MCP is the new standard for agent-tool integration — build MCP servers, not framework-specific adapters.
- Start single-agent, graduate to multi-agent only when isolation, parallelism, or specialization demands it.
- Guardrails and observability matter more than the model — a weaker model with good plumbing beats a stronger model in a naive loop.
- Budget for 15x token cost when moving from single to multi-agent architectures; measure ROI per workflow.
- Treat agent engineering as platform engineering: evals, tracing, feature flags, and rollback are non-negotiable.
Read also
- Agents IA & automatisationMay 11, 2026
AI Agents in Production: MCP, Tool Use, and Orchestration
From autonomous agents to multi-agent orchestration with MCP and LangGraph — what actually works in enterprise settings, with patterns, pitfalls and code.
Read article - Agents IA & automatisationApril 30, 2026
Building Production AI Agents: MCP, Tools & Orchestration
Autonomous agents are leaving the demo phase. Here's what actually works in production: MCP, tool use patterns, and multi-agent orchestration.
Read article - DevSecOpsMay 14, 2026
DevSecOps in 2025: A Practical Pipeline Blueprint
Shift-left security is a discipline, not a slogan. Here's how to wire SBOMs, SAST/DAST and secret scanning into CI without slowing your teams down.
Read article