Why AI Agents Fail in Production (And What to Do About It)

I’ve spent the last few months building agentic systems — LLM-powered workflows that take a goal, break it into steps, call tools, and try to complete tasks autonomously. Some of them work well. A lot of them failed in ways I didn’t expect.

The hype around agents right now is enormous. Every framework promises you can “just describe what you want” and an agent will figure out the rest. In demos, this looks incredible. In production, it falls apart in ways that are deeply boring and deeply frustrating.

This article is about the failure modes I’ve actually encountered, why they happen, and the architectural decisions that made the difference between a system I could trust and one I couldn’t.

The fundamental problem with agents

A traditional software system is deterministic. Given the same input, you get the same output. You can test it, profile it, reason about its failure modes, and set up alerts when something goes wrong.

An agent is not like this. It’s a control loop with an LLM at the center. The LLM decides what to do next based on context that accumulates over multiple steps. Every decision introduces variance. Errors don’t just fail — they compound. A wrong tool call in step 3 corrupts the context for step 4, which produces a bad result in step 5, and by the time the agent finishes you have output that is wrong in a way that’s hard to trace back to its source.

This is the core engineering problem. You’re building a system where the control flow is partially non-deterministic, errors propagate silently, and the failure mode is “wrong answer” rather than “exception thrown.”

Failure mode 1: The agent runs forever

This one hits you the first time and you never forget it.

An agent gets stuck in a loop. It calls a tool, the tool returns an unexpected result, the agent interprets that result incorrectly, decides to call a different tool to resolve the confusion, that call also returns something unexpected, and now it’s going in circles. Each loop costs tokens. Left unchecked, it runs until you hit your API spending limit or someone notices the CPU has been pegged for six hours.

The fix is hard limits, not soft ones. Set a maximum number of steps. Set a maximum execution time. Set a maximum token budget. When any limit is hit, the agent stops and returns whatever partial result it has along with a clear indication that it hit a limit. Do not let the agent decide when it’s done — impose a ceiling from outside.

MAX_STEPS = 25
MAX_TOKENS = 50_000
MAX_SECONDS = 120

class AgentRunner:
    def run(self, goal: str) -> AgentResult:
        steps = 0
        tokens_used = 0
        start = time.time()

        while True:
            if steps >= MAX_STEPS:
                return AgentResult(status="limit_steps", output=self.partial_output())
            if tokens_used >= MAX_TOKENS:
                return AgentResult(status="limit_tokens", output=self.partial_output())
            if time.time() - start > MAX_SECONDS:
                return AgentResult(status="limit_time", output=self.partial_output())

            result = self.step()
            steps += 1
            tokens_used += result.tokens

            if result.done:
                return AgentResult(status="ok", output=result.output)

Simple. Boring. Absolutely necessary.

Failure mode 2: Tools that can’t fail safely

Agents use tools — functions they can call to interact with the world. Read a file. Search the web. Call an API. Write to a database.

The problem is that tool calls are side effects, and the agent has no inherent understanding of which side effects are reversible. If you give an agent a tool to send an email and it calls that tool three times because it wasn’t sure the first call succeeded, you’ve sent three emails. If you give it a tool to delete a record and the agent misidentifies which record to delete, that record is gone.

The principle I now apply to every tool I give an agent: read operations are free, write operations need a confirmation step, and destructive operations don’t exist in the agent’s toolset.

For destructive or high-consequence writes, the agent doesn’t execute them — it proposes them. The actual execution happens in a separate layer that a human (or a separate validation system) approves. The agent’s job is to figure out what should happen, not to make it happen unilaterally.

# The agent can call this
def propose_deletion(record_id: str, reason: str) -> str:
    pending_actions.append({
        "type": "delete",
        "record_id": record_id,
        "reason": reason,
        "proposed_at": datetime.utcnow().isoformat()
    })
    return f"Deletion of {record_id} queued for review. Reason: {reason}"

# This never goes in the agent's toolset
def execute_deletion(record_id: str) -> None:
    db.delete(record_id)

It’s more friction. It’s also the only way I’ve found to sleep soundly when an agent is running against real data.

Failure mode 3: Context poisoning

An agent’s context window accumulates over the course of a run. Early steps, tool outputs, reasoning traces — it all piles up. And LLMs are not neutral consumers of context. They’re biased toward recent tokens, they can be confused by contradictory information in the same window, and a badly-formatted tool output from step 2 can subtly distort every decision made after it.

I call this context poisoning. The agent isn’t making bad decisions because the model is bad. It’s making bad decisions because the information it’s reasoning over has become noisy or contradictory.

The practical mitigations:

Summarize rather than accumulate. After every N steps, have the agent write a brief structured summary of what it has established so far. Replace the raw step history with the summary. This compresses context and forces the model to commit to a coherent understanding of what it knows.

Validate tool output before it enters context. Don’t pass raw API responses straight into the context window. Parse them, extract the relevant fields, and pass a clean, structured representation. A 4,000-token JSON blob containing three useful fields and 200 irrelevant ones is poisoning your context.

Be explicit about uncertainty. When a tool returns an ambiguous result, your prompt should instruct the agent to flag uncertainty explicitly rather than proceeding. “I’m not sure whether X or Y is the case” is better context than the agent guessing and propagating that guess downstream.

Failure mode 4: The agent that doesn’t know what it doesn’t know

This is the failure mode that produces the most confidently wrong output.

When an agent hits a situation where it lacks the information to proceed correctly, it has a few options: ask for clarification, flag the gap, or make its best guess and continue. Without explicit instruction, LLMs tend toward the third option. They’re trained to be helpful and complete, and “I don’t know” is not a satisfying completion.

The result is an agent that produces detailed, well-formatted, completely wrong output because it encountered an unknown in step 4 and quietly filled it in with a plausible guess.

You can push back against this at the prompt level. Being explicit in your system prompt that the agent should stop and surface unknowns rather than infer through them does help. But the more reliable fix is structural: build checkpoints into your workflow where the agent must produce a structured assessment of what it knows, what it doesn’t know, and what assumptions it’s making. If that assessment shows gaps, the workflow pauses rather than continuing.

CHECKPOINT_PROMPT = """
Before proceeding, produce a structured assessment:

ESTABLISHED: What facts have been confirmed by tool results?
ASSUMED: What are you treating as true without confirmation?
UNKNOWN: What information would you need that you don't have?

If ASSUMED or UNKNOWN contains anything critical to the task,
stop here and report those gaps rather than continuing.
"""

It adds latency. It also saves you from shipping confidently wrong output to users.

Failure mode 5: No observability

This one is operational rather than architectural, but it compounds everything else.

When a traditional service behaves unexpectedly, you look at logs, traces, and metrics. You can see what happened at each step and understand why.

When an agent behaves unexpectedly, what do you look at? If you haven’t built for it, the answer is nothing. You have an input, an output, and no visibility into the 23 steps and 14 tool calls in between.

At minimum, log every step: the model’s reasoning, the tool called, the arguments, the result, the tokens used, and the time taken. Store these structured logs somewhere you can query them. When an agent produces a bad result, you should be able to reconstruct exactly what happened and at which step it went wrong.

Better than logs alone: traces. Something like Langfuse, Weights & Biases, or a custom tracing layer that lets you visualize the full execution tree. Being able to see an agent run as a graph — each node a step, edges showing what informed what — makes debugging dramatically faster.

The overhead of good observability is real. The overhead of debugging a production agent failure without it is catastrophic.

What actually works

After all of this: what does a reliable agent look like in practice?

It has hard resource limits applied externally. Its tools are designed so that reads are always safe, writes are cautious, and destruction is handled outside the agent’s control. It summarizes rather than accumulates context. It has structured checkpoints where it must account for what it knows and doesn’t know. Every step is logged and traceable.

It’s also scoped narrowly. The most reliable agents I’ve built are not general-purpose reasoners. They do one specific class of task, with a well-defined set of tools, against a well-understood set of inputs. The breadth of what you ask an agent to handle is inversely proportional to how much you can trust it.

The best framing I’ve found: think of an agent not as a replacement for a human but as a very fast, very tireless junior who needs a clearly defined process, explicit guardrails, and a supervisor who reviews consequential decisions. Give it that structure and it’s remarkably useful. Give it open-ended autonomy over your production systems and you’ll find out about every assumption you forgot to validate.

Agents are not magic. They’re software with an LLM in the control loop, and they need to be engineered accordingly. The teams shipping reliable agentic systems aren’t treating the LLM as infallible — they’re treating it as a powerful but fallible component and building everything around that reality.

That reframe is, in my experience, the difference between a demo and a product.