
Every week, someone publishes a demo of an AI agent that browses the web, writes code, manages a calendar, and books flights. The demo takes four minutes. The real question – can this run reliably for thousands of users in production? – is never answered, because the answer is usually no. Not yet.
The gap between a working agent demo and a production agent system is enormous. It is not a matter of polish. It is a different category of engineering entirely. Demos operate in controlled environments with happy-path inputs. Production operates in the real world, where users do unexpected things, APIs fail, models hallucinate, costs spiral, and the agent needs to recover gracefully from all of it.
At CONFLICT, we have built agentic systems for clients ranging from automated customer support workflows to infrastructure management tools. The engineering challenges are consistent, regardless of the domain. This article covers the ones that matter most and the patterns we have developed to address them.
A simple LLM call is stateless: input goes in, output comes out. An agent is stateful by definition. It maintains a plan, tracks progress, remembers previous tool calls, and adapts based on intermediate results. Managing that state reliably is the first production challenge.
In a demo, state lives in memory. In production, state needs to survive process restarts, handle concurrent executions, and be inspectable for debugging. This means persisting state to a durable store at every step.
The naive approach is to serialize the entire agent state to a database after each action. This works until your state object gets large (long conversation histories, accumulated tool results) and serialization/deserialization becomes a bottleneck. The better approach is a state machine with explicit transitions:
class AgentState:
def __init__(self, task_id: str):
self.task_id = task_id
self.status = "initialized"
self.plan = []
self.completed_steps = []
self.current_step = None
self.context = {}
self.error_history = []
def transition(self, new_status: str, step_result: dict = None):
self.status = new_status
if step_result:
self.completed_steps.append(step_result)
self._persist()
def _persist(self):
# Write to durable store -- database, Redis, etc.
store.save(self.task_id, self.serialize())
Each state transition is an explicit, logged event. If the process crashes between steps, you know exactly where it was. If a step needs to be retried, the state machine knows what has already been completed.
The state machine also gives you visibility. When a user asks “what is my agent doing?” you can answer with the current state and step history. When a support engineer needs to debug a stuck agent, they can inspect the state rather than reading through logs.
Agents do things by calling tools – APIs, databases, file systems, other services. In a demo, tool calls work because the demo data is clean and the APIs are up. In production, tool calls fail for dozens of reasons: authentication errors, rate limits, timeouts, malformed inputs, unexpected response formats, service outages.
Every tool call in a production agent needs:
Input validation. Before calling a tool, validate that the agent’s proposed inputs are well-formed. Models sometimes generate malformed tool calls – missing required parameters, wrong data types, values that violate constraints. Catching these before the call is cheaper than handling the downstream error.
Timeout management. Every tool call needs a timeout. An agent that waits indefinitely for a hung API call is an agent that is stuck forever. Set timeouts per tool based on expected response times, and handle timeout errors as a normal part of the retry logic.
Retry with backoff. Transient failures should be retried automatically. Implement exponential backoff with jitter to avoid thundering herd problems when a service recovers from an outage.
Result validation. After a tool call returns, validate the result. Did the API return the expected structure? Are the values within expected ranges? An agent that blindly trusts tool results will make decisions based on garbage data.
Idempotency. If a tool call might be retried (because of a timeout, a process restart, or an error), the call needs to be idempotent – calling it twice should produce the same result as calling it once. For read operations, this is trivial. For write operations (creating records, sending messages, making payments), it requires careful design with idempotency keys.
class ToolExecutor:
def execute(self, tool_name: str, params: dict, agent_state: AgentState) -> dict:
tool = self.registry.get(tool_name)
# Input validation
validated = tool.validate_input(params)
if not validated.is_valid:
return {"status": "error", "reason": validated.errors}
# Execute with retry
for attempt in range(tool.max_retries):
try:
result = tool.call(validated.params, timeout=tool.timeout_seconds)
# Result validation
if tool.validate_output(result):
return {"status": "success", "data": result}
else:
return {"status": "error", "reason": "Invalid tool output"}
except TimeoutError:
if attempt < tool.max_retries - 1:
sleep(backoff(attempt))
continue
return {"status": "timeout", "attempts": attempt + 1}
except RateLimitError:
sleep(backoff(attempt, base=30))
continue
except Exception as e:
agent_state.error_history.append({
"tool": tool_name, "error": str(e), "attempt": attempt
})
return {"status": "error", "reason": str(e)}
How an agent handles failure separates production systems from demos. Demos do not fail because they are scripted. Production agents fail constantly, and the question is whether they recover intelligently.
Error recovery in agents happens at three levels:
Tool-level recovery. A single tool call fails. The retry logic handles transient errors. For permanent errors (invalid input, missing resource), the agent needs to adapt its plan. If the “get_customer_record” tool returns a 404, the agent should not retry – it should recognize that the customer does not exist and adjust its approach.
Step-level recovery. An agent step fails after all retries are exhausted. The agent needs to decide: skip this step and continue? Try an alternative approach? Escalate to a human? This decision should not be left to the LLM alone. Encode recovery strategies in your step definitions:
step_config = {
"name": "fetch_customer_data",
"tool": "crm_api",
"on_failure": {
"strategy": "alternative",
"alternatives": ["cache_lookup", "manual_input_request"],
"max_alternatives": 2,
"escalate_after_exhaustion": True
}
}
Plan-level recovery. The entire plan becomes invalid because of changed conditions. The customer canceled their order while the agent was processing it. The API the agent depends on has been deprecated. At this level, the agent needs to re-plan from its current state, not restart from scratch. This is where checkpointing becomes critical – the agent saves enough state that re-planning starts from progress already made, not from zero.
The key design principle is that error handling should be deterministic, not probabilistic. Do not ask the LLM to decide how to handle errors at runtime. Define error recovery strategies in configuration, test them, and let the agent follow the defined paths. Use the LLM for reasoning about the task, not for reasoning about error handling.
An autonomous agent making tool calls can burn through API budgets fast. Each “thinking” step is an LLM call. Each tool use might involve another LLM call for parameter extraction. Multi-step reasoning on a complex task can easily hit 20-30 LLM calls, each with its own cost.
Production cost management for agents requires:
Per-task budgets. Set a maximum cost (in dollars, tokens, or API calls) per task. When the budget is exhausted, the agent stops and escalates rather than continuing to burn resources. This catches runaway agents – agents stuck in loops, agents that keep retrying failed operations, agents that generate overly complex plans.
Token tracking. Track input and output tokens separately. Long conversation histories inflate input tokens at every step. Implement context window management that summarizes or truncates history to keep input costs manageable without losing critical context.
Model tiering. Not every agent step needs the most powerful model. Use cheaper, faster models for simple decisions (tool parameter extraction, classification, validation) and reserve expensive models for complex reasoning. This is the same multi-model routing pattern we use in PlanOpticon, applied at the agent level.
Batching where possible. If an agent needs to process 100 items, do not make 100 sequential LLM calls. Batch them where the task allows it. Process items in parallel. Use structured outputs to handle multiple items in a single call where appropriate.
class CostGuard:
def __init__(self, max_cost_usd: float, max_calls: int):
self.max_cost = max_cost_usd
self.max_calls = max_calls
self.total_cost = 0.0
self.total_calls = 0
def check(self, estimated_cost: float) -> bool:
if self.total_cost + estimated_cost > self.max_cost:
raise BudgetExceededError(
f"Task budget ${self.max_cost} would be exceeded. "
f"Current spend: ${self.total_cost:.4f}"
)
if self.total_calls >= self.max_calls:
raise CallLimitExceededError(
f"Maximum {self.max_calls} API calls exceeded."
)
return True
def record(self, actual_cost: float):
self.total_cost += actual_cost
self.total_calls += 1
Debugging a traditional application means reading logs and stack traces. Debugging an agent means understanding why the LLM made a particular decision at step 7 of a 15-step plan, given the accumulated context of the previous 6 steps. This is a fundamentally harder problem.
Production agent systems need observability at multiple levels:
Decision traces. Log not just what the agent did, but why. Capture the LLM input (full prompt including context), the LLM output (including reasoning if using chain-of-thought), and the resulting action. This is your primary debugging tool when an agent behaves unexpectedly.
Step-level metrics. For each step: duration, cost, number of tool calls, number of retries, success/failure. These metrics tell you where your agent is spending time and money.
Plan-level metrics. For each task: total steps, total cost, total duration, completion rate, escalation rate. These metrics tell you whether your agent is effective at its job.
Anomaly detection. Flag agent runs that are statistical outliers – unusually long, unusually expensive, unusually many retries. These are the runs that reveal bugs, edge cases, and systemic issues.
We have learned that investing in observability early pays for itself many times over. The alternative is manually investigating user complaints by reading raw logs, which does not scale and does not catch problems proactively.
If you have a working agent prototype and you want to take it to production, here is the engineering work ahead of you:
State persistence. Move state from in-memory to durable storage. Implement state machine transitions with explicit logging. Test crash recovery by killing the process mid-execution and resuming.
Tool hardening. Add input validation, output validation, timeouts, retries, and idempotency to every tool. Test each tool with malformed inputs and unavailable backends.
Error recovery. Define and test recovery strategies for every failure mode. Tool failures, plan failures, budget exhaustion, model errors. Do not rely on the LLM to improvise error handling.
Cost controls. Implement per-task budgets, token tracking, and cost alerting. Test that cost guards actually stop execution when limits are reached.
Observability. Instrument every LLM call, tool call, and state transition. Build dashboards for agent performance metrics. Set up alerts for anomalies.
Human escalation. Define the escalation paths for when the agent cannot complete a task. Build the interfaces for human reviewers. Test the end-to-end escalation flow.
Load testing. Run multiple agents concurrently. Verify that state isolation works, that database connections scale, that rate limits are managed across all concurrent agents.
Security review. Audit what tools the agent can access and what data it can read. Implement least-privilege access. Verify that prompt injection attacks cannot cause the agent to call tools it should not.
This list is not exhaustive, but it covers the gaps that cause the most production failures. The work is substantial, which is why most agent demos never become agent products.
AI agents will become a standard part of production systems. The models are improving, the tooling is maturing, and the patterns are becoming well-understood. But the gap between demo and production is real and it is engineering, not magic.
The teams that will succeed are the ones that treat agent development as a systems engineering discipline: rigorous state management, defensive tool integration, explicit error handling, strict cost controls, and comprehensive observability. The LLM is the reasoning engine. Everything around it is engineering.
We build these systems because we have seen what happens when the engineering is right – agents that handle thousands of tasks daily with predictable cost and quality. And we have seen what happens when it is wrong – expensive, unreliable systems that erode trust in AI.
The difference is not the model. It is the engineering around the model.