An AI Agent Deleted Our CI System While I Was on the Runway at Heathrow
Jun 10, 2026
I gave an AI agent kubectl access to production. It deleted our CI. Now agents open PRs instead — and the first one just fixed a volume alert at 4am.
I gave Claude Code kubectl access to our production Kubernetes cluster while sitting on the runway at Heathrow, waiting to take off for San Francisco. By the time I was at 10,000 feet and the unbearably slow satellite internet kicked in, the agent had deleted our CI system.
I spent half the flight rebuilding it — asking colleagues to WhatsApp me secrets because I couldn't load web pages over United's glacial satellite wifi. (Seriously, United, get Starlink.)
The agent wasn't malicious. It was trying to help. It just had the permissions to act on a bad judgment call, and no human in the loop to catch it. Speed is the whole point of AI agents — but speed pointed directly at production, with no review step, is how you end up rebuilding Drone CI at 35,000 feet over Greenland.
So we tried something different. Give the agent git access instead of kubectl. Make it open a PR. Let a human merge it. Let Flux do the rest.
Last week, at 4am, an alert fired and we put it to the test.
The alert
KubePersistentVolumeFillingUp
PersistentVolume claimed by drone in namespace drone
is expected to fill up within four days.
Currently 8.527% is available.
Standard stuff. The Drone CI volume was provisioned at 8Gi years ago and finally ran out of room. The fix is one line in a HelmRelease:
# before
size: 8Gi
# after
size: 20GiWhat happened
Instead, I configured Helix Org to hire a DevOps agent into a role where it receives webhooks from Alertmanager and has access to our infra GitOps repo. It found the drone PVC config in a Flux HelmRelease, read the Longhorn docs to confirm online expansion would work (block device resize, then automatic ext4 resize2fs, no downtime), and opened a PR. One line changed.
It also flagged something I wouldn't have thought about at 4am: some Helm charts template PVCs as immutable resources, so Helm upgrade might not actually patch the size. The agent included a manual kubectl patch fallback in the PR description. Just in case.
The PR: helixml/infra#18.
I merged it. Here's the cluster a few minutes later:
$ kubectl get helmrelease -n drone
NAME AGE READY STATUS
drone 73d True Helm upgrade succeeded for release drone/drone.v2 with chart drone@0.6.5
$ kubectl get pvc -n drone
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS
drone Bound pvc-7631ee32-a41b-4c1c-9afd-e0f85fd1cfd8 20Gi RWO longhorn
$ kubectl exec -n drone deploy/drone -- df -h /data
Filesystem Size Used Available Use% Mounted on
/dev/longhorn/pvc-... 19.6G 7.2G 12.3G 37% /data
Flux reconciled the HelmRelease. Helm patched the PVC. Longhorn expanded the block device and ran the filesystem resize. All online, no pod restart, no downtime. 91.5% full → 37%.
Not a soul touched the cluster.
Why kubectl was always going to end in tears
The Heathrow incident wasn't a fluke. It was the predictable outcome of a bad permission model.
kubectl is a loaded gun. An agent that can kubectl patch can also kubectl delete. The permission boundary between "resize a volume" and "delete a namespace" is the same credential. There's no granularity. And unlike a human engineer who might hesitate before running a destructive command, the agent has no anxiety. It executes confidently and moves on.
When an agent runs kubectl directly, the change happens outside your GitOps flow. Your repo no longer matches reality. That's drift, and drift is how clusters quietly rot. And when something goes wrong — as it will — there's nothing to revert. No commit. No diff. Just whatever someone can reconstruct from memory. I was at 10,000 feet. Nobody reconstructed anything.
The whole point of GitOps is that changes go through a pipeline: code review, merge, reconciliation. Giving an agent kubectl bypasses all of it. You're trusting the model to get it right every time, with no checkpoint. That's a bet I lost over Heathrow.
Git as the interface
So now the agent gets git access, not cluster access. It can read the repo, create a branch, open a PR. It cannot push to main, it cannot run commands on the cluster, it cannot bypass review.
It's the same permission model as a new hire on their first week. You'd let them open a PR to bump a volume size. You would not give them production credentials and say "go fix it."
The flow is:
Alert fires
→ Agent searches the GitOps repo
→ Agent researches the fix
→ Agent opens a PR
→ Human reviews and merges
→ Flux/ArgoCD reconciles
→ Alert resolves
The agent never touches the cluster. The human keeps merge authority. Everything is in git — auditable, reviewable, revertable.
Where this goes
The PVC resize was a gentle start — one line, clear fix, low risk. But the same approach works for harder problems.
Resource tuning. CPUThrottlingHigh fires. The agent reads current limits from the repo, checks Prometheus metrics through read-only Grafana, and proposes new values with reasoning in the PR: "p99 CPU peaked at 450m over the last 7 days, current limit is 500m, recommending 800m." I can sanity-check the numbers before they ship.
Rollbacks. A deployment starts crash-looping after an image update. The agent finds the previous tag from git log, opens a PR reverting to it, includes the error context. Compare this with kubectl rollout undo — which works, but leaves your repo pointing at the broken version. Drift again.
Coordinated changes. HPAMaxReplicasReached — the autoscaler is capped and traffic is still growing. At 4am you'd probably fix the HPA and forget the node pool. The agent can propose both in one PR:
PR: Scale API tier for increased traffic
- deployments/api/hpa.yaml: maxReplicas 10 → 25
- cluster/node-pools.tf: api_pool_max 5 → 10
Security remediations. A scanner flags an overly permissive NetworkPolicy. The agent proposes a tighter one. This is where the review step matters most — network policy changes break things in subtle ways. The agent proposes, I validate against what I know about the traffic patterns, and only then does it merge.
Terraform. Same idea, different files. Right-sizing instances when CloudWatch flags underutilization. Adding required tags for compliance. Adjusting ASG sizes. The review step matters even more here — cloud changes can be expensive. I want to confirm that bumping an RDS instance from db.r5.large to db.r5.xlarge is actually the right call, not just a reflex.
Building trust
Right now, I review every PR. That's fine. The review takes ten seconds for a one-line diff.
Eventually, once I've seen the agent correctly handle 50 PVC resizes, I'll set up an auto-merge rule for PRs that only touch persistentVolume.size. The agent is still constrained to the GitOps flow — it still can't kubectl delete anything — but I'm not the bottleneck on the boring stuff anymore.
And then the agent starts monitoring its own changes. After merge, it checks that the volume actually expanded, confirms the alert resolved, closes the ticket. If something goes wrong, it opens a follow-up PR. I'm in the loop via notifications, but I'm not doing any manual work.
We're not there yet. The PVC resize was step one. But the path from "human reviews everything" to "auto-merge low-risk changes" to "agent monitors its own rollouts" is clear enough. And every step of the way, the agent is constrained to git. It can propose whatever it wants. It can't do anything without a merge.
This is what Helix Org gives us. The DevOps agent occupies its role in the org chart, subscribed to Alertmanager webhooks, and activates on every alert. No human has to copy-paste an alert into a chat window at 4am. The agent is just there, the same way an on-call engineer would be — except it doesn't mind being woken up.
Phil is presenting Helix Org at Agent Craft London tomorrow — come say hi if you're around.
When it doesn't work
This needs GitOps. If your infrastructure isn't managed through a repo that something reconciles, there's no PR to open. You also need someone who can review promptly — if the PR sits for three days, your volume fills up anyway.
It's less useful for novel debugging where you need to read logs and inspect pod state. And if the volume fills up in 20 minutes instead of 4 days, you probably don't have time for a review cycle. Though if you've built enough trust for auto-merge on low-risk changes, maybe you do.
Anyway
We spent a decade moving from "SSH in and edit config files" to "commit YAML and let the pipeline deploy it." AI agents shouldn't drag us back to the SSH era just because they're fast at it.
The agent that fixed our volume alert didn't need kubectl. It needed git push and good judgement. And when it wasn't sure — about Helm's PVC patching behaviour, about the StorageClass defaults — it said so, and included fallbacks.
Not the agent that deleted my CI system at 10,000 feet. The one that opens a clean PR at 4am so I can merge it over coffee at 9.
helixml/infra#18 — one line, 8Gi to 20Gi. The volume expanded online. The alert resolved. Not a soul touched the cluster.