I Let AI Agents Open Pull Requests Against Production Code

Six months ago I wrote that AI agents break at 2am.

This year I shipped one that fixes the bug before anyone wakes up — and opens the pull request itself.

I want to be precise about what that sentence means, because "AI writes code" has become marketing noise. I'm not talking about autocomplete. I'm not talking about a chatbot that suggests a fix in a side panel. I'm talking about an agent that watches production, catches a real error, works out the cause, and submits a pull request that a human reviews and merges.

For one of the teams I architect for, the number of pull requests opened this way became their single most-watched engineering metric. Here's how that happened — and the guardrails that made it something I'd actually put my name on.

A production error is the perfect trigger

Most "agentic coding" demos start from a prompt. Someone types "build me a login page" and the agent goes. That's the wrong starting point for production, because the hard part was never generating code — it's knowing what to change and why.

A production error already answers both.

When an exception fires in production, you get a stack trace, the exact line, the input that broke it, the frequency, and the blast radius. That's a far richer and far more bounded brief than anything a human types into a prompt box. The scope is small. The intent is unambiguous. The success criterion is concrete: the error stops.

So the trigger isn't a person asking for a feature. It's the system raising its own hand. An error lands in your monitoring, and that becomes the work order.

The pipeline, in one breath

The flow is mechanical once you see it:

An error is captured in production with full context.
An agent triages it — is this real, is it new, is it worth fixing now?
A second pass does root-cause analysis against the actual codebase.
The agent drafts the fix, writes a test, and opens a pull request.
A human reviews and merges. Always.

Every one of those steps is a place it can go wrong. Which is the entire point of the next section.

The guardrails are the product

Code generation was never the hard part. Models are good at writing a five-line fix for a clear bug. The engineering — the part that took the real work — was making it safe to let it run unattended.

A triage gate before anything else. Not every error deserves a fix. Some are noise, some duplicate an open ticket, some are a third-party outage you can't fix in your own repo. The agent has to decide not to act far more often than it acts. Getting the "do nothing" path right is what keeps the pipeline from burying the team in low-value PRs.

Deduplication against existing work. The same error fires a thousand times. If you open a thousand pull requests, you've built a denial-of-service attack on your own reviewers. One error class, one PR.

A human-filed ticket always wins. When a person has already triaged something, the agent defers to that judgment instead of overriding it. The machine's opinion is an input, not a verdict.

Evaluation before merge — not just tests. This is where the LLM-judge work I wrote about earlier comes back. A passing test proves the code runs; it doesn't prove the fix is right, or that it's the change a senior engineer would have made. So a separate evaluation pass scores the diff against the original error and the codebase's conventions before it's ever offered to a human. Most of what gets generated never reaches a PR.

The merge button stays human. No agent merges to production. The whole system is built to produce a reviewable artifact, not to close the loop autonomously. That one rule is what makes everything upstream of it acceptable to ship.

What actually happened

Honest version: the first weeks produced a lot of confidently wrong fixes. The agent would address the symptom in the stack trace and miss the cause one call up. It would "fix" an error by quietly swallowing it. It would invent a requirement nobody asked for — the same failure mode I'd seen in LLM judges.

Each of those was a guardrail I didn't have yet. The triage gate, the dedup, the eval pass, the convention-checking — none of them existed in version one. They exist because version one embarrassed me.

What it looks like now: a meaningful share of routine production bugs arrive each morning as drafted, tested, reviewable pull requests. Engineers spend their attention reviewing and merging instead of triaging and context-switching at 2am. And because "pull requests opened by the agent" is a number you can watch climb, it became the metric the team rallied around — not as vanity, but as a direct proxy for how much reliability-draining busywork came off humans that week.

What to actually do

If you want to build something like this — and the bar is lower than it looks — these are the non-negotiables:

Trigger from real signal, not prompts. A production error is a better brief than anything a human types.
Spend your engineering on the "don't act" path. Triage and dedup matter more than generation.
Evaluate the diff, don't just run the tests. A green suite isn't a correct fix.
Defer to humans who've already decided. The agent's judgment is an input.
Keep the merge human. Produce an artifact a person approves. Always.

None of this is the agent writing your software for you. It's the agent doing the part that never gets done well at 3am — and handing a human the cleanest possible decision.

If you're building with AI agents and trying to work out where they actually belong in your stack — book a 30-minute call. No pitch, just an honest technical conversation.