Rootly (YC S21) – Manage Incidents in Slack

Hi HN, Quentin and JJ here! We are co-founders at Rootly (https://rootly.com/), an incident management platform built on Slack. Rootly helps automate manual admin work during incidents like the creation of Slack channels, Jira tickets, Zoom rooms & more. We also help you get data on your incidents and help automate postmortem creation.

We met at Instacart, where I was the first SRE and JJ was on the product side owning ~20% GMV on the enterprise and last-mile delivery business. As Instacart grew from processing hundreds to millions of orders, we had to scale our infrastructure, teams, and processes to keep up with this growth. Unsurprisingly, this led to our fair share of incidents (e.g. checkout issues, site outages, etc.) and a lot of restless nights while on-call.

This was further compounded by COVID-19 and the first wave of lockdowns. We surged in traffic by 500% overnight as everyone turned to online grocery. This highlighted our need for a better incident management process as it stressed every element of it. Our manual ways of working in Slack, PagerDuty, Datadog, simply weren’t enough. At first, we figured this was an Instacart-specific problem but luckily realized it wasn’t.

A few things here. Our process lacked consistency. Depending on who was responding and their incident experience it varied greatly. Most companies after they declare an incident rely on a buried-away runbook like on Confluence/Google Docs to try and follow a lengthy checklist of steps. This is hard to find, difficult to follow accurately, slow, and stress inducing. Especially after you’ve been woken up to a page at 3 am. We started working on how to automate this.

Fast forward to today, companies like Canva, Grammarly, Bolt, Faire, Productboard, OpenSea, Shell use Rootly for their incident response. We think of ourselves as part of the post-alerting workflow. Tools like PagerDuty, Datadog act like a smoke alarm to alert you to an incident, which hand off to Rootly so we can orchestrate the actual response.

We’ve learned a lot along the way. We realized the majority of our customers use the same 6 (Slack, PagerDuty, Jira, Zoom, Confluence, Google Docs, etc.) tools, follow roughly the same incident response process (create incident → collaborate → write postmortem), but their process varies dramatically. The challenge in changing these processes is hard.

Our focus in the early days was build a hyper opinionated product to help them follow what we believe are the best practices. Now our product direction is focused on configuration and flexibility, how can we plug Rootly into your already existing way of working and automate it. This has helped our larger enterprise customers be successful with their current processes being automated.

Our biggest competition is not PagerDuty/Opsgenie (in fact 98% of our customers use them) or other startups. Its internal tooling companies have built out of necessity, often because tools like Rootly didn’t exist yet. Stripe (https://www.youtube.com/watch?v=fZ8rvMhLyI4) and GitLab (https://about.gitlab.com/handbook/engineering/infrastructure...) are good examples of this.

Our journey is just getting started as we learn more each day. Would love to hear any feedback on our product or anything you find frustrating about incident response today.

Leaving you with a quick demo: https://www.loom.com/share/313a8f81f0a046f284629afc3263ebff



Get Top 5 Posts of the Week



best of all time best of today best of yesterday best of this week best of this month best of last month best of this year best of 2023 best of 2022 yc w24 yc s23 yc w23 yc s22 yc w22 yc s21 yc w21 yc s20 yc w20 yc s19 yc w19 yc s18 yc w18 yc all-time 3d algorithms animation android [ai] artificial-intelligence api augmented-reality big data bitcoin blockchain book bootstrap bot css c chart chess chrome extension cli command line compiler crypto covid-19 cryptography data deep learning elexir ether excel framework game git go html ios iphone java js javascript jobs kubernetes learn linux lisp mac machine-learning most successful neural net nft node optimisation parser performance privacy python raspberry pi react retro review my ruby rust saas scraper security sql tensor flow terminal travel virtual reality visualisation vue windows web3 young talents


andrey azimov by Andrey Azimov