upvote
>> how the runbooks can self heal if results from some steps in the middle are not expected.

Yeah this is a very interesting angle. Our primary mechanism here is via agent created auto-memories today. The agent keeps track of the most useful steps, and more importantly, dead end steps as it executes runbooks. We think this offers a great bridge to suggest runbook updates and keep them current.

>> Curious how much savings do you observe from using runbook versus purely let Claude do the planning at first.

Really depends on runbook quality, so I don't have a straightforward answer. Of course, it's faster and cheaper if you have well defined steps in your runbooks. As an example, `check logs for service frontend, faceted by host_name`, vs. `check logs`. Agent does more exploration in the latter case.

We wrote about the LLM costs of investigating production alerts more generally here, in case helpful: https://relvy.ai/blog/llm-cost-of-ai-sre-investigating-produ...

reply
Re: savings - it depends on the use case. For example, one of our users set up a small runbook to run a group-by-IP query for high-throughput alerts, since that was their most common first response to those alerts. That alone cuts out a couple of minutes of exploration per incident and removes the variability of the agent deciding what data to investigate and how to slice it.

In our experience, runbooks provide a consistent, fast, and reliable way of investigating incidents (or ruling out common causes). In their absence, the AI does its usual open-ended exploration.

reply