Parity (YC S24) – AI for on-call engineers working with Kubernetes

Hey HN — we’re Jeffrey, Coleman, and Wilson, and we’re building Parity (https://tryparity.com), an AI SRE copilot for on-call engineers working with Kubernetes. Before you've opened your laptop, Parity has conducted an investigation to triage, determine root cause, and suggest a remediation for an issue. You can check out a quick demo of Parity here: https://tryparity.com/demo

We met working together as engineers at Crusoe, a cloud provider, and we always dreaded being on-call. It meant a week of putting our lives and projects on hold to be prepared to firefight an issue at any hour of the day. We experienced sleepless nights after being woken up by a PagerDuty alert to then find and follow a runbook. We canceled plans to make time to sift through dashboards and logs in search of the root cause of downtime in our k8s cluster.

After speaking with other devs and SREs, we realized we weren’t alone. While every team wants better monitoring systems or a more resilient design, the reality is that time and resources are often too limited to make these investments.

We’re building Parity to solve this problem. We’re enabling engineers working with Kubernetes to more easily handle their on-call by using AI agents to execute runbooks and conduct root cause analysis. We knew LLMs could help given their ability to quickly process and interpret large amounts of data. But we’ve found that LLMs alone aren’t sufficiently capable, so we’ve built agents to take on more complex tasks like root cause analysis. By allowing on-call engineers to handle these tasks more easily and eventually freeing them from such responsibilities, we create more time for them to focus on complex and valuable engineering investments.

We built an agent to investigate issues in Kubernetes by following the same steps a human would: developing a possible root cause, validating it with logs and metrics, and iterating until a well-supported root cause is found. Given a symptom like “we’re seeing elevated 503 errors”, our agent develops hypotheses as to why this may be the case, such as nginx being misconfigured or application pods being under-resourced. Then, it gathers the necessary information from the cluster to either support or rule out those hypotheses. These results are presented to the engineer as a report with a summary and each hypothesis. It includes all the evidence the agent considered when coming to a conclusion so that an engineer can quickly review and validate the results. With the results of the investigation, an on-call engineer can focus on implementing a fix.

We’ve built an additional agent to automatically execute runbooks when an alert is triggered. It follows steps of a runbook more rigorously than an LLM alone and with more flexibility than workflow automation tools like Temporal. This agent is a combination of separate LLM agents each responsible for a single step of the runbook. Each runbook step agent will execute arbitrary instructions like “look for nginx logs that could explain the 503 error”. A separate LLM will evaluate the results, ensuring the step agent followed the instructions, and determines which subsequent step of the runbook to execute. This allows us to execute runbooks with cycles, retries, and complex branching conditions.

With these tools, we aim to handle the “what’s going wrong” part of on-call for engineers. We still believe it makes the most sense to continue to trust engineers with actually resolving issues as this requires potentially dangerous or irreversible commands. For that reason, our agents exclusively execute read-only commands.

If this sounds like it could be useful for you, we’d love for you to give the product a try! Our service can be installed in your cluster via a helm repo in just a couple of minutes. For our HN launch, we’ve removed the billing requirement for new accounts, so you can test it out on your cluster for free.

We’d love to hear your feedback in the comments!



Get Top 5 Posts of the Week



best of all time best of today best of yesterday best of this week best of this month best of last month best of this year best of 2024 best of 2023 yc w25 yc s24 yc w24 yc s23 yc w23 yc s22 yc w22 yc s21 yc w21 yc s20 yc w20 yc s19 yc w19 yc s18 yc w18 yc all-time 3d algorithms animation android [ai] artificial-intelligence api augmented-reality big data bitcoin blockchain book bootstrap bot css c chart chess chrome extension cli command line compiler crypto covid-19 cryptography data deep learning elexir ether excel framework game git go html ios iphone java js javascript jobs kubernetes learn linux lisp mac machine-learning most successful neural net nft node optimisation parser performance privacy python raspberry pi react retro review my ruby rust saas scraper security sql tensor flow terminal travel virtual reality visualisation vue windows web3 young talents


andrey azimov by Andrey Azimov