navflow cookbook

The SRE incident-response agent, on NavFlow

A companion to Anthropic's Site Reliability Agent cookbook. We keep that agent and change one thing — its read path.

If you've worked through the Anthropic cookbook, you have an incident-response agent that investigates a production issue, finds the root cause, and remediates it. To investigate, it calls about ten tools — Prometheus, container logs, config files, deploy history, alerts — one at a time, and correlates the results in its own context.

In this recipe we replace those ten reads with a single NavFlow query that returns the same signals already correlated, then add a trigger so NavFlow wakes the agent with the context attached. The agent's reasoning, action tools, and safety hooks are untouched. Every number below is a real run of the cookbook's agent against its own demo incident, with only the read path swapped.

What you'll do

Run the cookbook's agent as a baseline and measure its read path.
Point the agent at NavFlow and collapse the read path to one query.
Add a trigger so NavFlow wakes the agent with the incident already attached.

Prerequisites

The Anthropic SRE cookbook running locally — its Docker demo (api-server, Postgres, Prometheus, a traffic generator) is the system we break and fix.
The Claude Agent SDK and an Anthropic API key.
A NavFlow project — its MCP endpoint and an API key.

The starting point: the cookbook's read path

The demo incident is a classic one: a deploy reduces the api-server's database connection pool from 20 to 1. With traffic flowing, the single connection is overwhelmed, 500s spike, and p99 latency blows past a second. In the cookbook, the agent investigates by fanning out across its tools:

cookbook · read path (python)

# Each signal is a separate round-trip; the agent merges them in its head.
health  = await call_tool("get_service_health", {})
alerts  = await call_tool("get_alerts", {})
metrics = await call_tool("query_metrics", {"promql": 'rate(http_requests_total{status="500"}[1m])'})
latency = await call_tool("query_metrics", {"promql": "histogram_quantile(0.99, ...)"})
pool    = await call_tool("query_metrics", {"promql": "db_connections_active"})
config  = await call_tool("read_config_file", {"path": "config/api-server.env"})
logs    = await call_tool("get_container_logs", {"container": "api-server"})
deploys = await call_tool("get_recent_deployments", {})
# ... about ten calls, then a manual time-merge to find the cause

It works — the agent correctly traces the spike back to the deploy. But every call is a live round-trip, and the agent holds hundreds of metric series and thousands of log lines in context while it reasons.

Baseline, measured: 10 read calls, 12 agent turns, ~126k tokens of context, $0.66, 51 seconds (claude-opus-4-8).

What the agent actually saw

These are not illustrative figures. We ran the cookbook agent against its demo incident and captured the run. Here is the investigation it actually performed — ten read calls, in the order it made them:

cookbook agent · investigation trace (10 reads, in order)

 1   get_service_health
 2   get_alerts
 3   get_recent_deployments
 4   query_metrics
 5   query_metrics
 6   query_metrics
 7   read_config_file     config/api-server.env
 8   read_config_file     config/api-server.env.backup
 9   list_metrics
10   get_container_logs   api-server

Each call returned raw signal that the agent then had to correlate itself. The metrics it pulled from Prometheus while the incident was live:

prometheus · during the incident

500 rate  user-svc      = 5.16 /s     (baseline ~0.1)
500 rate  payment-svc   = 0.08 /s
500 rate  auth-svc      = 0.02 /s
p99 latency             = 4901 ms
db_pool_size            = 1
db_connections_active   = 1
db_connections_max      = 1

And the api-server container logs it read — the smoking gun:

docker logs · api-server

2026-06-09 11:44:41  ERROR  Connection pool exhausted: QueuePool limit of
                            size 1 overflow 0 reached, connection timed out,
                            timeout 2.00   (sqlalche.me/e/20/3o7r)
172.19.0.7  "GET /api/users HTTP/1.1"  500 Internal Server Error

Ten calls and a manual merge later, the agent had its answer: the deploy cut the pool to 1. That correlation — across metrics, logs, config, and deploys — is the work NavFlow moves server-side.

Step 1 · Point the agent at NavFlow

NavFlow ingests the same systems once and exposes one tool over MCP. Replace the ten read tools in the agent's options with the NavFlow server:

connect the agent (python)

from claude_agent_sdk import ClaudeAgentOptions, query

options = ClaudeAgentOptions(
    system_prompt=SRE_SYSTEM_PROMPT,
    mcp_servers={
        "navflow": {
            "url": "https://mcp.navflow.ai/<your-project>",
            "headers": {"Authorization": f"Bearer {NAVFLOW_KEY}"},
        }
    },
    allowed_tools=["mcp__navflow__query"],   # the read tools are gone
    model="claude-opus-4-8",
)

Nothing else about the agent changes — same system prompt, same action tools, same hooks.

Step 2 · One query instead of ten

Now the agent reads the whole incident in a single call, keyed by service and windowed:

one call (python)

result = await call_tool("query", {
    "view":   "service_timeline",
    "key":    "api-server",
    "window": "15m",
})

NavFlow returns the signals already merged and time-ordered — metrics, logs, the config change, the deploy, and the alert in one payload:

navflow → service_timeline

=== service_timeline · key=api-server · window=15m ===

[ T-60s ] [deployments] deploy #1847 by alice — "Reduce DB connection pool size
                        for staging parity" (commit a7f3d2e), ~60s before symptoms
[ T-60s ] [config     ] config change: DB_POOL_SIZE 20 -> 1 (drifted from known-good)
[  T-1s ] [docker-logs] ERROR Connection pool exhausted: QueuePool limit of size 1
                        overflow 0 reached, connection timed out, timeout 2.00
[  now  ] [prometheus ] 500 rate user-svc = 5.16/s · p99 = 4901ms ·
                        db_pool_size = 1 · db_connections_active = 1
[  now  ] [alerts     ] FIRING (critical): DBConnectionPoolExhausted on api-server

The causal chain is already in order: the deploy cut the pool to 1, the pool exhausted, requests timed out, 500s spiked. The agent reads one organized view and concludes, instead of fanning out and reassembling. It reaches the same root cause.

With NavFlow, measured: 1 read call, 3 turns, ~97k tokens, $0.15, 23 seconds — same diagnosis.

Step 3 · Let NavFlow trigger the agent

The read path was the pull. NavFlow also handles the push. Instead of waiting for PagerDuty and cold-starting the agent, define a condition over the live stream. This lives in NavFlow, not your agent:

trigger (json)

{
  "name": "error_spike_after_deploy",
  "source": { "view": "service_timeline" },
  "when": {
    "predicate": "error_rate > 0.05 within 10m of deploy_event",
    "group_by": ["service"]
  },
  "emit": "incident_opened"
}

On the agent side, handle the event. NavFlow attaches the windowed, correlated timeline as the payload, so the agent boots holding the context and fetches nothing:

handle the trigger (python)

async def on_navflow_trigger(event):
    # event.payload is the same service_timeline, attached by NavFlow
    prompt = (
        "NavFlow trigger fired: error_spike_after_deploy on api-server. "
        "The correlated timeline is attached — confirm the root cause "
        "and recommend the fix.\n\n" + event.payload
    )
    async for message in query(prompt=prompt, options=options):
        handle(message)

Triggered, measured: the condition fired the instant the spike registered and the agent confirmed the root cause in 0 read calls, 1 turn, $0.25, 18 seconds — no PagerDuty, no human, no fan-out.

What we measured

All three are the same agent on the same incident. Only the read path differs.

Metric	Cookbook	NavFlow query	NavFlow triggered
Read calls	10	1	0
Agent turns	12	3	1
Context (tokens)	~126k	~97k	least
Cost	$0.66	$0.15	$0.25
Wall clock	51s	23s	18s
Root cause	found	found	found

Ten reads collapse to one, and the cost with them — roughly a quarter of the baseline; with a trigger, the agent wakes already holding the answer. The same swap, measured across four different faults, is in the extended benchmark.

What stays the same

NavFlow only takes the read path. Everything that makes the agent yours is untouched:

The agentic loop (Claude Agent SDK or any MCP-aware runtime).
Action tools: edit_config_file, run_shell_command, write_postmortem.
Safety hooks (PreToolUse validators, command allowlists).
Your system prompt and reasoning.

The remediation phase — edit, redeploy, verify, post-mortem — runs exactly as it does in the cookbook. You are not rebuilding the agent; you are giving it a saner read path.

Recap

Starting from Anthropic's SRE agent, we swapped its read tools for one NavFlow query and added a stream trigger. The agent reached the same root cause in every run, while the read path went from 10 calls to 1 to 0. The investigation logic stayed the same; only where the data comes from changed.

Next: the extended benchmark stages four distinct faults — pool exhaustion, latency, a bad feature flag, and a dependency outage — each behind a deliberately vague deploy, and runs this same comparison on every one.

Get a NavFlow project at cloud.navflow.ai.