You Spend Millions on Reliability. So why Does Everything Still Break?

Your team spends millions on cloud-native infrastructure and the teams and tools behind it, yet incidents persist—frustrating customers, demoralizing engineers, and perplexing leadership. Why is reliability still elusive?

How did we get here?

Since Google’s release of Site Reliability Engineering in 2016, countless organizations have tried to find the delicate balance of reliability, velocity, and cost-efficiency—only to discover that striking this balance in practice is far harder than it appears in theory. Why?

Superficial adoption of SRE
Excessive complexity in Cloud Infra
Slow and overwhelmed incident response (the Death Spiral)
Scarcity of qualified talent

Many adopted SRE principles superficially, layering tools and hiring specialists without fundamentally changing how they operate. Others equated “doing SRE” with increased budgets and headcount, believing more resources would automatically translate to fewer outages. Instead, complexity multiplied, incident response became slower, and reliability teams found themselves overwhelmed by noise and manual toil. Even those who genuinely invested in transforming their organization found that between scarce talent, overwhelming tool sprawl, and a million other challenges, Google’s vision of an SRE organization was close to impossible to achieve.

So here you are with millions of dollars in spend on tools that overpromise and underdeliver and a team that’s stretched too thin to think beyond their next on-call rotation.

But Wait, It’s About to Get Worse

DORA’s recent Gen AI report shows that developers are already reaping the productivity benefits of tools like Cursor and GitHub Copilot. Agentic tools like Anthropic’s Claude Code give us a glimpse of an even more transformative future, where code can be written with minimal human guidance. What’s good news for developers is frightening news for reliability teams. Empowered by AI, releases and new features will be pumped out faster than ever with less human oversight than before. While we may hope in the long run that AI will produce perfect code, reliability teams will be left to deal with the consequences in the meantime. Increased velocity will inevitably mean increased incidents. And leadership who believed reliability was a resource problem (a narrative vendors happily told them) — solvable by throwing more money, more engineers, and more observability tools at it — will soon learn that the speed enabled by AI demands a different approach.

We’ve Been Here Before: Lessons from the Cloud Transition

This isn’t to say that we should slam the brakes on AI adoption. faced similar challenges during the transition to the cloud: developers rapidly deployed microservices, reliability teams struggled to keep pace, and the industry responded by creating a new paradigm, SRE, along with new tools and practices. We'll see a similar evolution with AI.

One of the core strengths of the original SRE vision was its pragmatism and deep empathy for the business case. Guided by these same principles, SRE teams will again find practical solutions that allow their organizations to realize the upside of AI. Fortunately, these teams have already navigated a similar transition with the cloud, an experience that positions them to proactively leverage AI, not just endure it. Rather than being overwhelmed by AI-driven velocity, reliability teams will harness AI themselves to improve their own capabilities and effectiveness.

The Way Out

The path forward is to embrace a fundamentally different approach: proactive, AI-driven reliability. Instead of relying exclusively on human judgment and reactive incident response, organizations need to leverage AI to anticipate failures before they happen. By harnessing predictive insights, teams can identify and address potential incidents ahead of customer impact.

AI-driven observability tools will automatically sift through noise, intelligently triaging alerts so that engineers focus only on signals that truly matter. This frees reliability teams from the endless cycle of firefighting and manual toil, allowing them to invest time in systemic improvements that prevent incidents at the source. Rather than simply reacting faster, teams will finally have the breathing room, and the capability, to build genuinely resilient systems that scale reliably with their businesses. Any seasoned SRE will rightfully recognize that this is easier said than done. We all watched as AIOps promised the moon and delivered a total nothingburger. Let’s hold a cautious optimism with a focus on outcomes. You don’t have to look any further than the top of a Twitter feed or your favorite conference’s keynotes to see that experimentation is happening. Not everything will work, some things will, and those that do will transform this industry.

Your team spends millions on cloud-native infrastructure and the teams and tools behind it, yet incidents persist—frustrating customers, demoralizing engineers, and perplexing leadership. Why is reliability still elusive?

How did we get here?

Since Google’s release of Site Reliability Engineering in 2016, countless organizations have tried to find the delicate balance of reliability, velocity, and cost-efficiency—only to discover that striking this balance in practice is far harder than it appears in theory. Why?

Superficial adoption of SRE
Excessive complexity in Cloud Infra
Slow and overwhelmed incident response (the Death Spiral)
Scarcity of qualified talent

Many adopted SRE principles superficially, layering tools and hiring specialists without fundamentally changing how they operate. Others equated “doing SRE” with increased budgets and headcount, believing more resources would automatically translate to fewer outages. Instead, complexity multiplied, incident response became slower, and reliability teams found themselves overwhelmed by noise and manual toil. Even those who genuinely invested in transforming their organization found that between scarce talent, overwhelming tool sprawl, and a million other challenges, Google’s vision of an SRE organization was close to impossible to achieve.

So here you are with millions of dollars in spend on tools that overpromise and underdeliver and a team that’s stretched too thin to think beyond their next on-call rotation.

But Wait, It’s About to Get Worse

DORA’s recent Gen AI report shows that developers are already reaping the productivity benefits of tools like Cursor and GitHub Copilot. Agentic tools like Anthropic’s Claude Code give us a glimpse of an even more transformative future, where code can be written with minimal human guidance. What’s good news for developers is frightening news for reliability teams. Empowered by AI, releases and new features will be pumped out faster than ever with less human oversight than before. While we may hope in the long run that AI will produce perfect code, reliability teams will be left to deal with the consequences in the meantime. Increased velocity will inevitably mean increased incidents. And leadership who believed reliability was a resource problem (a narrative vendors happily told them) — solvable by throwing more money, more engineers, and more observability tools at it — will soon learn that the speed enabled by AI demands a different approach.

We’ve Been Here Before: Lessons from the Cloud Transition

This isn’t to say that we should slam the brakes on AI adoption. faced similar challenges during the transition to the cloud: developers rapidly deployed microservices, reliability teams struggled to keep pace, and the industry responded by creating a new paradigm, SRE, along with new tools and practices. We'll see a similar evolution with AI.

One of the core strengths of the original SRE vision was its pragmatism and deep empathy for the business case. Guided by these same principles, SRE teams will again find practical solutions that allow their organizations to realize the upside of AI. Fortunately, these teams have already navigated a similar transition with the cloud, an experience that positions them to proactively leverage AI, not just endure it. Rather than being overwhelmed by AI-driven velocity, reliability teams will harness AI themselves to improve their own capabilities and effectiveness.

The Way Out

The path forward is to embrace a fundamentally different approach: proactive, AI-driven reliability. Instead of relying exclusively on human judgment and reactive incident response, organizations need to leverage AI to anticipate failures before they happen. By harnessing predictive insights, teams can identify and address potential incidents ahead of customer impact.

AI-driven observability tools will automatically sift through noise, intelligently triaging alerts so that engineers focus only on signals that truly matter. This frees reliability teams from the endless cycle of firefighting and manual toil, allowing them to invest time in systemic improvements that prevent incidents at the source. Rather than simply reacting faster, teams will finally have the breathing room, and the capability, to build genuinely resilient systems that scale reliably with their businesses. Any seasoned SRE will rightfully recognize that this is easier said than done. We all watched as AIOps promised the moon and delivered a total nothingburger. Let’s hold a cautious optimism with a focus on outcomes. You don’t have to look any further than the top of a Twitter feed or your favorite conference’s keynotes to see that experimentation is happening. Not everything will work, some things will, and those that do will transform this industry.

Your team spends millions on cloud-native infrastructure and the teams and tools behind it, yet incidents persist—frustrating customers, demoralizing engineers, and perplexing leadership. Why is reliability still elusive?

How did we get here?

Since Google’s release of Site Reliability Engineering in 2016, countless organizations have tried to find the delicate balance of reliability, velocity, and cost-efficiency—only to discover that striking this balance in practice is far harder than it appears in theory. Why?

Superficial adoption of SRE
Excessive complexity in Cloud Infra
Slow and overwhelmed incident response (the Death Spiral)
Scarcity of qualified talent

Many adopted SRE principles superficially, layering tools and hiring specialists without fundamentally changing how they operate. Others equated “doing SRE” with increased budgets and headcount, believing more resources would automatically translate to fewer outages. Instead, complexity multiplied, incident response became slower, and reliability teams found themselves overwhelmed by noise and manual toil. Even those who genuinely invested in transforming their organization found that between scarce talent, overwhelming tool sprawl, and a million other challenges, Google’s vision of an SRE organization was close to impossible to achieve.

So here you are with millions of dollars in spend on tools that overpromise and underdeliver and a team that’s stretched too thin to think beyond their next on-call rotation.

But Wait, It’s About to Get Worse

DORA’s recent Gen AI report shows that developers are already reaping the productivity benefits of tools like Cursor and GitHub Copilot. Agentic tools like Anthropic’s Claude Code give us a glimpse of an even more transformative future, where code can be written with minimal human guidance. What’s good news for developers is frightening news for reliability teams. Empowered by AI, releases and new features will be pumped out faster than ever with less human oversight than before. While we may hope in the long run that AI will produce perfect code, reliability teams will be left to deal with the consequences in the meantime. Increased velocity will inevitably mean increased incidents. And leadership who believed reliability was a resource problem (a narrative vendors happily told them) — solvable by throwing more money, more engineers, and more observability tools at it — will soon learn that the speed enabled by AI demands a different approach.

We’ve Been Here Before: Lessons from the Cloud Transition

This isn’t to say that we should slam the brakes on AI adoption. faced similar challenges during the transition to the cloud: developers rapidly deployed microservices, reliability teams struggled to keep pace, and the industry responded by creating a new paradigm, SRE, along with new tools and practices. We'll see a similar evolution with AI.

One of the core strengths of the original SRE vision was its pragmatism and deep empathy for the business case. Guided by these same principles, SRE teams will again find practical solutions that allow their organizations to realize the upside of AI. Fortunately, these teams have already navigated a similar transition with the cloud, an experience that positions them to proactively leverage AI, not just endure it. Rather than being overwhelmed by AI-driven velocity, reliability teams will harness AI themselves to improve their own capabilities and effectiveness.

The Way Out

The path forward is to embrace a fundamentally different approach: proactive, AI-driven reliability. Instead of relying exclusively on human judgment and reactive incident response, organizations need to leverage AI to anticipate failures before they happen. By harnessing predictive insights, teams can identify and address potential incidents ahead of customer impact.

AI-driven observability tools will automatically sift through noise, intelligently triaging alerts so that engineers focus only on signals that truly matter. This frees reliability teams from the endless cycle of firefighting and manual toil, allowing them to invest time in systemic improvements that prevent incidents at the source. Rather than simply reacting faster, teams will finally have the breathing room, and the capability, to build genuinely resilient systems that scale reliably with their businesses. Any seasoned SRE will rightfully recognize that this is easier said than done. We all watched as AIOps promised the moon and delivered a total nothingburger. Let’s hold a cautious optimism with a focus on outcomes. You don’t have to look any further than the top of a Twitter feed or your favorite conference’s keynotes to see that experimentation is happening. Not everything will work, some things will, and those that do will transform this industry.