Why Your AI Agent Passes All Tests and Fails at 2am

Every AI demo works.

The video is perfect. The investors nod. The product hunt launch goes well. And then you ship it — and three weeks later you're getting a Slack message at 2am that the agent is stuck in a loop, burning tokens, and your largest customer's workflow has been broken for six hours.

This is not a bug. It's a structural problem. And I've seen it happen to almost every team that ships LLM-based systems for the first time.

Here's what's actually going on.

Your tests can't predict a language model

Unit tests verify that your code does what you think it does. But when your system calls an LLM, you're not calling code — you're calling a probability distribution. The same input produces different outputs across runs. Edge cases emerge from combinations of tokens you never tested. The model gets updated upstream and starts behaving slightly differently in ways no test would catch.

The teams that survive this build LLM-judge evaluation frameworks — automated pipelines that run real queries through your system and have a second model evaluate the quality of the output. It's not perfect, but it shifts your testing from "does this code path run?" to "does this system produce the right result?".

Retry logic is not optional

Your agent calls three external services: the LLM, your vector database, and your application API. Each of those will occasionally return a 429, a 503, or just time out.

If your agent doesn't have explicit retry logic with exponential backoff, a single transient failure breaks the entire workflow. And because agents are stateful — they're in the middle of something — a broken workflow at 2am means a human has to manually untangle whatever state got corrupted.

Every external call in your agent needs:

Retry with backoff (3 attempts minimum)
Timeout enforcement
Circuit breaking for sustained failures

You don't have an audit trail

When something goes wrong at 2am, the first question is: what did the agent actually do?

Most teams can't answer this. The agent ran, it failed, and now there's no record of what prompts it sent, what it received, or which decision branches it took.

Prompt versioning is the foundation of this. Every prompt that goes to the LLM should be tagged with a version identifier that gets logged alongside the request and response. When you change a prompt, that change is tracked. When the agent fails, you can replay the exact sequence that led to the failure.

This sounds like operational overhead. It's actually the only way to debug LLM systems at scale.

Token costs are not fixed

Your agent costs $0.02 per query in development. You ship it. Usage scales. Three months later you're spending $1,200/month you didn't plan for, and you don't know why.

Token cost in production is a function of conversation length, model routing, and how often your agent triggers tool calls. None of these are fixed. Without instrumentation on token usage per request, you can't see the cost growing until it's already a problem.

The fix is straightforward: log input and output token counts on every request, group by workflow type, and set budget alerts at 20% over baseline. This takes a day to instrument and saves significant money and surprise.

What to actually do

If you're building an AI agent that will go to production, these are not optional:

LLM-judge evaluation — don't just test code paths, evaluate outputs
Retry + timeout on every external call — including the LLM itself
Prompt versioning — every prompt tagged and logged with requests
Token instrumentation — cost per request, grouped by workflow type
Alerting on anomalies — token spikes, failure rates, latency p99

None of this is glamorous. It's the operational foundation that makes the difference between an agent that works in the demo and one that works at 2am on a Tuesday.

If you're building something with LLMs and want a second opinion on the architecture before you ship — book a 30-minute call. No pitch, just an honest technical conversation.