We met at Instacart, where I was the first SRE and JJ was on the product side owning ~20% GMV on the enterprise and last-mile delivery business. As Instacart grew from processing hundreds to millions of orders, we had to scale our infrastructure, teams, and processes to keep up with this growth. Unsurprisingly, this led to our fair share of incidents (e.g. checkout issues, site outages, etc.) and a lot of restless nights while on-call.
This was further compounded by COVID-19 and the first wave of lockdowns. We surged in traffic by 500% overnight as everyone turned to online grocery. This highlighted our need for a better incident management process as it stressed every element of it. Our manual ways of working in Slack, PagerDuty, Datadog, simply weren’t enough. At first, we figured this was an Instacart-specific problem but luckily realized it wasn’t.
A few things here. Our process lacked consistency. Depending on who was responding and their incident experience it varied greatly. Most companies after they declare an incident rely on a buried-away runbook like on Confluence/Google Docs to try and follow a lengthy checklist of steps. This is hard to find, difficult to follow accurately, slow, and stress inducing. Especially after you’ve been woken up to a page at 3 am. We started working on how to automate this.
Fast forward to today, companies like Canva, Grammarly, Bolt, Faire, Productboard, OpenSea, Shell use Rootly for their incident response. We think of ourselves as part of the post-alerting workflow. Tools like PagerDuty, Datadog act like a smoke alarm to alert you to an incident, which hand off to Rootly so we can orchestrate the actual response.
We’ve learned a lot along the way. We realized the majority of our customers use the same 6 (Slack, PagerDuty, Jira, Zoom, Confluence, Google Docs, etc.) tools, follow roughly the same incident response process (create incident → collaborate → write postmortem), but their process varies dramatically. The challenge in changing these processes is hard.
Our focus in the early days was build a hyper opinionated product to help them follow what we believe are the best practices. Now our product direction is focused on configuration and flexibility, how can we plug Rootly into your already existing way of working and automate it. This has helped our larger enterprise customers be successful with their current processes being automated.
Our biggest competition is not PagerDuty/Opsgenie (in fact 98% of our customers use them) or other startups. Its internal tooling companies have built out of necessity, often because tools like Rootly didn’t exist yet. Stripe (https://www.youtube.com/watch?v=fZ8rvMhLyI4) and GitLab (https://about.gitlab.com/handbook/engineering/infrastructure...) are good examples of this.
Our journey is just getting started as we learn more each day. Would love to hear any feedback on our product or anything you find frustrating about incident response today.
Leaving you with a quick demo: https://www.loom.com/share/313a8f81f0a046f284629afc3263ebff