Aug 17, 2024

How and Why We Made SREBench, SWEBench for Kubernetes

We made SREBench, a k8s task dataset, to evalute LLM performance on root causing kubernetes issues

Jeffrey Tsaw

Co-founder of Parity

The search for SWEBench but for Kubernetes

When we first started working on Parity (you can read our launch here!), we dove head first into trying to build as capable of an AI agent as we could. We experimented with different frameworks, prompts, and architectures until we finally built something we were happy with. At this point we realized we were a little stuck. We had what we were telling ourselves was this awesome AI agent, but didn’t really have any proof to back it up.

Up until this point we’d just been running our agent against our own perfectly healthy cluster with made up scenarios like “investigate high memory usage in our cluster”, or “why is my pod restarting frequently”, and seeing if our agent could tell if they were false positives. With some tuning our agent got great at this too, and we felt it was time to give it some real challenges.

With the popularity of SWEBench (a benchmark for LLM coding agents), we thought there must be a similar dataset or benchmark for SRE related tasks right?

Nope.

We didn’t find any publicly available benchmark or dataset that contained any SRE related tasks. The closest we found was work by researchers at the Alibaba group that benchmarked a similar agent on their own cloud. Sadly, we were (and are) 1 hugely successful cloud business short of being able to do this too.

So, without the internet’s help, we set out to figure out a way to evaluate our agent.

Attempt 1: Use a real cluster, inject real issues

This was our thought process: Kubernetes clusters are relatively easy to spin up. Why don’t we just make a test cluster, put it into a bad state, and see if our agent could figure out the root cause.

We ended up getting stuck at the second part: putting the cluster in a bad state. We spent some time trying to figure out how to do this with just kubectl commands with no luck. As we were about to give up, someone referred us to LitmusChaos, a platform for injecting chaos into your Kubernetes cluster. We were saved…or so we thought.

After poking around the platform, we realized it was going to be more complex and take more time to setup than we were willing to commit.

We decided to abandon the LitmusChaos idea. We wanted to focus on building a good product now, and taking a week to build a scalable testing framework around this technology would be a distraction. We wanted something a little easier to setup that could give us some indication how good our agent was.

Attempt 2: Let’s just throw AI at it and see what happens

Here’s what we thought. At the end of the day, root cause analysis is just a reasoning task. Given an issue, the goal is to collect facts, process them, and reason through them to figure out the root cause. We thought briefly about ways to transpose an issue, root cause pair to a problem space with an actual benchmark like math problems, but couldn’t think of a good way of doing this and gave up.

While we were going down the rabbit hole of transposing kubernetes issues, we came across another reasoning benchmark: MuSR. MuSR is a murder mystery reasoning benchmark for LLMs that was recently published in ICLR 2024. It is comprised of a set murder mystery stories, and the murderer in each story. What was interesting about it to us was that it’s a synthetic benchmark, i.e generated using LLMs.

The tl;dr version of the paper is they tell and LLM who the murder is and ask it to generate facts. Example of facts could be Jason is the murderer, Jason has a motive, etc. The researchers then take each fact and ask another LLM to generate logical statements based on it. For example, Jason has a motive ⇒ Jason owed money to people ⇒ Jason has a gambling problem. Once a set of facts and statements can be made, they ask an LLM to generate a reasonable story and that becomes the dataset!

At this point we started to get excited. If the researchers could use an LLM with knowledge of murderers to generate facts consistent with that, we could use an LLM that knows the root cause of a Kubernetes related issue to generate facts consistent with it.

SREBench is born!

Armed with this, we set out to build SREBench. We started by scraping StackOverflow, ServerFault, and Reddit for common Kubernetes issues and their root causes. We then built a platform that would reroute any kubectl command our agent called to an LLM. This LLM was knew that actual root cause, and was prompted to respond with an output consistent with the root cause. We then fed our agent the issue, and compared what it thought of the root cause against the actual root cause.

With the evaluation of our agent part done, we just needed something to compare it to. So, we built an internal version of sreben.ch and tried it ourselves. After that we thought let’s try and get more people to try it. We quickly threw together a website with the benchmark on it, offered $100 to the top scorer to incentives people to actually try, and called it https://sreben.ch and the competition was born!

If you have any questions or comments about SREBench or what we’re doing at Parity, feel free to reach out at founders@tryparity.com

The search for SWEBench but for Kubernetes

When we first started working on Parity (you can read our launch here!), we dove head first into trying to build as capable of an AI agent as we could. We experimented with different frameworks, prompts, and architectures until we finally built something we were happy with. At this point we realized we were a little stuck. We had what we were telling ourselves was this awesome AI agent, but didn’t really have any proof to back it up.

Up until this point we’d just been running our agent against our own perfectly healthy cluster with made up scenarios like “investigate high memory usage in our cluster”, or “why is my pod restarting frequently”, and seeing if our agent could tell if they were false positives. With some tuning our agent got great at this too, and we felt it was time to give it some real challenges.

With the popularity of SWEBench (a benchmark for LLM coding agents), we thought there must be a similar dataset or benchmark for SRE related tasks right?

Nope.

We didn’t find any publicly available benchmark or dataset that contained any SRE related tasks. The closest we found was work by researchers at the Alibaba group that benchmarked a similar agent on their own cloud. Sadly, we were (and are) 1 hugely successful cloud business short of being able to do this too.

So, without the internet’s help, we set out to figure out a way to evaluate our agent.

Attempt 1: Use a real cluster, inject real issues

This was our thought process: Kubernetes clusters are relatively easy to spin up. Why don’t we just make a test cluster, put it into a bad state, and see if our agent could figure out the root cause.

We ended up getting stuck at the second part: putting the cluster in a bad state. We spent some time trying to figure out how to do this with just kubectl commands with no luck. As we were about to give up, someone referred us to LitmusChaos, a platform for injecting chaos into your Kubernetes cluster. We were saved…or so we thought.

After poking around the platform, we realized it was going to be more complex and take more time to setup than we were willing to commit.

We decided to abandon the LitmusChaos idea. We wanted to focus on building a good product now, and taking a week to build a scalable testing framework around this technology would be a distraction. We wanted something a little easier to setup that could give us some indication how good our agent was.

Attempt 2: Let’s just throw AI at it and see what happens

Here’s what we thought. At the end of the day, root cause analysis is just a reasoning task. Given an issue, the goal is to collect facts, process them, and reason through them to figure out the root cause. We thought briefly about ways to transpose an issue, root cause pair to a problem space with an actual benchmark like math problems, but couldn’t think of a good way of doing this and gave up.

While we were going down the rabbit hole of transposing kubernetes issues, we came across another reasoning benchmark: MuSR. MuSR is a murder mystery reasoning benchmark for LLMs that was recently published in ICLR 2024. It is comprised of a set murder mystery stories, and the murderer in each story. What was interesting about it to us was that it’s a synthetic benchmark, i.e generated using LLMs.

The tl;dr version of the paper is they tell and LLM who the murder is and ask it to generate facts. Example of facts could be Jason is the murderer, Jason has a motive, etc. The researchers then take each fact and ask another LLM to generate logical statements based on it. For example, Jason has a motive ⇒ Jason owed money to people ⇒ Jason has a gambling problem. Once a set of facts and statements can be made, they ask an LLM to generate a reasonable story and that becomes the dataset!

At this point we started to get excited. If the researchers could use an LLM with knowledge of murderers to generate facts consistent with that, we could use an LLM that knows the root cause of a Kubernetes related issue to generate facts consistent with it.

SREBench is born!

Armed with this, we set out to build SREBench. We started by scraping StackOverflow, ServerFault, and Reddit for common Kubernetes issues and their root causes. We then built a platform that would reroute any kubectl command our agent called to an LLM. This LLM was knew that actual root cause, and was prompted to respond with an output consistent with the root cause. We then fed our agent the issue, and compared what it thought of the root cause against the actual root cause.

With the evaluation of our agent part done, we just needed something to compare it to. So, we built an internal version of sreben.ch and tried it ourselves. After that we thought let’s try and get more people to try it. We quickly threw together a website with the benchmark on it, offered $100 to the top scorer to incentives people to actually try, and called it https://sreben.ch and the competition was born!

If you have any questions or comments about SREBench or what we’re doing at Parity, feel free to reach out at founders@tryparity.com

The search for SWEBench but for Kubernetes

When we first started working on Parity (you can read our launch here!), we dove head first into trying to build as capable of an AI agent as we could. We experimented with different frameworks, prompts, and architectures until we finally built something we were happy with. At this point we realized we were a little stuck. We had what we were telling ourselves was this awesome AI agent, but didn’t really have any proof to back it up.

Up until this point we’d just been running our agent against our own perfectly healthy cluster with made up scenarios like “investigate high memory usage in our cluster”, or “why is my pod restarting frequently”, and seeing if our agent could tell if they were false positives. With some tuning our agent got great at this too, and we felt it was time to give it some real challenges.

With the popularity of SWEBench (a benchmark for LLM coding agents), we thought there must be a similar dataset or benchmark for SRE related tasks right?

Nope.

We didn’t find any publicly available benchmark or dataset that contained any SRE related tasks. The closest we found was work by researchers at the Alibaba group that benchmarked a similar agent on their own cloud. Sadly, we were (and are) 1 hugely successful cloud business short of being able to do this too.

So, without the internet’s help, we set out to figure out a way to evaluate our agent.

Attempt 1: Use a real cluster, inject real issues

This was our thought process: Kubernetes clusters are relatively easy to spin up. Why don’t we just make a test cluster, put it into a bad state, and see if our agent could figure out the root cause.

We ended up getting stuck at the second part: putting the cluster in a bad state. We spent some time trying to figure out how to do this with just kubectl commands with no luck. As we were about to give up, someone referred us to LitmusChaos, a platform for injecting chaos into your Kubernetes cluster. We were saved…or so we thought.

After poking around the platform, we realized it was going to be more complex and take more time to setup than we were willing to commit.

We decided to abandon the LitmusChaos idea. We wanted to focus on building a good product now, and taking a week to build a scalable testing framework around this technology would be a distraction. We wanted something a little easier to setup that could give us some indication how good our agent was.

Attempt 2: Let’s just throw AI at it and see what happens

Here’s what we thought. At the end of the day, root cause analysis is just a reasoning task. Given an issue, the goal is to collect facts, process them, and reason through them to figure out the root cause. We thought briefly about ways to transpose an issue, root cause pair to a problem space with an actual benchmark like math problems, but couldn’t think of a good way of doing this and gave up.

While we were going down the rabbit hole of transposing kubernetes issues, we came across another reasoning benchmark: MuSR. MuSR is a murder mystery reasoning benchmark for LLMs that was recently published in ICLR 2024. It is comprised of a set murder mystery stories, and the murderer in each story. What was interesting about it to us was that it’s a synthetic benchmark, i.e generated using LLMs.

The tl;dr version of the paper is they tell and LLM who the murder is and ask it to generate facts. Example of facts could be Jason is the murderer, Jason has a motive, etc. The researchers then take each fact and ask another LLM to generate logical statements based on it. For example, Jason has a motive ⇒ Jason owed money to people ⇒ Jason has a gambling problem. Once a set of facts and statements can be made, they ask an LLM to generate a reasonable story and that becomes the dataset!

At this point we started to get excited. If the researchers could use an LLM with knowledge of murderers to generate facts consistent with that, we could use an LLM that knows the root cause of a Kubernetes related issue to generate facts consistent with it.

SREBench is born!

Armed with this, we set out to build SREBench. We started by scraping StackOverflow, ServerFault, and Reddit for common Kubernetes issues and their root causes. We then built a platform that would reroute any kubectl command our agent called to an LLM. This LLM was knew that actual root cause, and was prompted to respond with an output consistent with the root cause. We then fed our agent the issue, and compared what it thought of the root cause against the actual root cause.

With the evaluation of our agent part done, we just needed something to compare it to. So, we built an internal version of sreben.ch and tried it ourselves. After that we thought let’s try and get more people to try it. We quickly threw together a website with the benchmark on it, offered $100 to the top scorer to incentives people to actually try, and called it https://sreben.ch and the competition was born!

If you have any questions or comments about SREBench or what we’re doing at Parity, feel free to reach out at founders@tryparity.com

Revolutionize Your Incident Response

Revolutionize Your Incident Response

Transform your on-call experience with Parity's AI SRE. Parity works alongside your engineers to resolve incidents.

Transform your on-call experience with Parity's AI SRE. Parity works alongside your engineers to resolve incidents.

Subscribe

2025 • Parity • San FRANCISCO

Subscribe

2025 • Parity • San FRANCISCO

Subscribe

2025 • Parity • San FRANCISCO