We met working together as engineers at Crusoe, a cloud provider, and we always dreaded being on-call. It meant a week of putting our lives and projects on hold to be prepared to firefight an issue at any hour of the day. We experienced sleepless nights after being woken up by a PagerDuty alert to then find and follow a runbook. We canceled plans to make time to sift through dashboards and logs in search of the root cause of downtime in our k8s cluster.
After speaking with other devs and SREs, we realized we weren’t alone. While every team wants better monitoring systems or a more resilient design, the reality is that time and resources are often too limited to make these investments.
We’re building Parity to solve this problem. We’re enabling engineers working with Kubernetes to more easily handle their on-call by using AI agents to execute runbooks and conduct root cause analysis. We knew LLMs could help given their ability to quickly process and interpret large amounts of data. But we’ve found that LLMs alone aren’t sufficiently capable, so we’ve built agents to take on more complex tasks like root cause analysis. By allowing on-call engineers to handle these tasks more easily and eventually freeing them from such responsibilities, we create more time for them to focus on complex and valuable engineering investments.
We built an agent to investigate issues in Kubernetes by following the same steps a human would: developing a possible root cause, validating it with logs and metrics, and iterating until a well-supported root cause is found. Given a symptom like “we’re seeing elevated 503 errors”, our agent develops hypotheses as to why this may be the case, such as nginx being misconfigured or application pods being under-resourced. Then, it gathers the necessary information from the cluster to either support or rule out those hypotheses. These results are presented to the engineer as a report with a summary and each hypothesis. It includes all the evidence the agent considered when coming to a conclusion so that an engineer can quickly review and validate the results. With the results of the investigation, an on-call engineer can focus on implementing a fix.
We’ve built an additional agent to automatically execute runbooks when an alert is triggered. It follows steps of a runbook more rigorously than an LLM alone and with more flexibility than workflow automation tools like Temporal. This agent is a combination of separate LLM agents each responsible for a single step of the runbook. Each runbook step agent will execute arbitrary instructions like “look for nginx logs that could explain the 503 error”. A separate LLM will evaluate the results, ensuring the step agent followed the instructions, and determines which subsequent step of the runbook to execute. This allows us to execute runbooks with cycles, retries, and complex branching conditions.
With these tools, we aim to handle the “what’s going wrong” part of on-call for engineers. We still believe it makes the most sense to continue to trust engineers with actually resolving issues as this requires potentially dangerous or irreversible commands. For that reason, our agents exclusively execute read-only commands.
If this sounds like it could be useful for you, we’d love for you to give the product a try! Our service can be installed in your cluster via a helm repo in just a couple of minutes. For our HN launch, we’ve removed the billing requirement for new accounts, so you can test it out on your cluster for free.
We’d love to hear your feedback in the comments!