The SRE's Identity Crisis in the Age of AI
As an SME of Ansible and our IAC modules my daily work is fixing errors, correcting inconsistencies and obsessing over idempotency. Every time I find a gap that monitoring could have caught, I build it. It’s a never-ending fight to keep the manual ‘noise’ from eating my whole day.
When I was supporting one of our fellow engineers we found out that our linux servers weren’t being configured after deployment. A quick look at the pipeline showed me that the servers were being found but Ansible couldn’t connect to them. I realized that it didn’t recognize the format of the hostnames it tried to connect to. I’ve been messing with this in the past. I quickly removed the custom ansible_host formatting I made as a workaround in the past and it started working again.
There is so much to do though. One day I’m helping an engineer with with a deployment. At the same time I have to do deployments of my own, keep an eye on the incoming tickets, troubleshoot something that is not working, improving cloud vm shutdown script and create recurring reports for management.
When there is so much to do and yet another request comes in, it’s really tempting to just jump to AI and ask the question to see if it can guide you to a solution faster. I’ve even see Microsoft engineers do this in a meeting where we have a question. They do say that they have to double check it. But it’s a weird way of working. Where we are very much supported by AI nowadays.
A few weeks ago I read the announcement of Azure SRE agent becoming generally available. We figured that we should try it out. So at the first opportunity we deployed it in one of our Azure subscriptions. It was an outage on some machines across multiple where we couldn’t really understand the logic. Why these machines and not others?
I was amazed by how it just started getting all the facts and ips of related devices in a list. I touched activity logs, graph explorer, server names, etc. Everything it needed. At some point it said:
‘Employee John Doe made a change on 11:34:21 and that’s when the problems started to arise.’
It couldn’t exactly see what was changed so we added the deployment template from the code. So it did a diff and it showed the exact changes. It was the first time I got a feeling of ‘damn we’re fucked’ as engineers. There is no way I can investigate an issue this fast and connect it all together in a matter of minutes.
Of course it’s not all glory and praise. The way AI works you still have to check everything yourself as at some point it starts hallucinating. it also asked for permission to make some changes. That feels pretty scary.
I’m afraid that in the wrong hands SRE Agent is potentially very dangerous. But in reality it’s the same old problem. If you don’t know what you’re doing you shouldn’t be making changes. I’m afraid that under pressure some people will try to fix it this way and accidentally break it.
Are we moving to a place where we are monitoring an AI agent instead of the systems themselves?
AI is really just not there yet and a lot of people question if that’s gonna be the case anytime soon. In the end AI is predicting the next word and SREs need to focus on upime and reducing toil.
Even though it’s not there yet, we should still try to work with it, otherwise we get caught up by time. It’s already helping us today!
It’s up to us to determine the business impact and be the moral compass of the infrastructure. The SRE agent and then do all the actions. only we can decide if they should be done.
Here is a slightly polished version of your draft
(I’ve kept your voice and your stories exactly as they are, but cleaned the “grease” off the gears so it runs smoother.)
The SRE’s Identity Crisis: From Ansible Playbooks to AI Agents
As an SME of Ansible and our IaC modules, my daily work is a cycle of fixing errors, correcting inconsistencies, and obsessing over idempotency. Every time I find a gap that monitoring could have caught, I build it. It’s a life of reducing toil, one playbook at a time.
Recently, while supporting a fellow engineer, we found our Linux servers weren’t being configured post-deployment. The pipeline was finding the servers, but Ansible couldn’t connect. I realized the system didn’t recognize the hostname format—a custom ansible_host workaround I’d built months ago was now the bottleneck. I stripped it out, and things started moving again.
But there is always so much to do. Between helping colleagues, managing my own deployments, triaging tickets, and writing recurring reports for management, the “toil” is constant. In that chaos, jumping to AI for a solution isn’t just tempting—it’s becoming the standard. I’ve even seen Microsoft engineers do this in meetings. They’ll query an LLM, add a “we have to double-check this” disclaimer, and move on. It’s a weird, new way of working.
The “Aha” Moment: The Azure SRE Agent
A few weeks ago, I saw that the Azure SRE Agent went GA (General Availability). We decided to test it during a strange outage where machines across multiple subscriptions were failing without a clear pattern.
I was amazed. In minutes, the agent started gathering facts, IP lists, and related devices. it crawled Activity Logs, Graph Explorer, and server names. Then it spit out the smoking gun:
‘Employee John Doe made a change at 11:34:21, and that’s when the problems started.’
It couldn’t see the exact change, so we fed it the deployment template code. It performed a diff and showed the exact conflict. For the first time, I had that feeling: “Damn, we’re fucked.” There is no way a human engineer can investigate an issue that fast, connecting a dozen data sources in a matter of minutes.
The Moral Compass of Infrastructure
Of course, it’s not all glory. AI still hallucinations. It asked for permission to make changes, which is—frankly—terrifying. In the wrong hands, an SRE Agent is dangerous. If you don’t know the underlying system, you shouldn’t be letting an AI “fix” it under pressure.
We are moving to a place where we might be monitoring the AI agents instead of the systems themselves. AI is “predicting the next word,” but SREs are responsible for uptime. It isn’t “there” yet, but we have to work with it now, or we’ll be left behind.
Ultimately, the AI can perform the actions, but it can’t understand the business impact. Our new identity isn’t about writing the fastest script anymore—it’s about being the moral compass of the infrastructure.