navflow cookbook
The SRE incident-response agent, on NavFlow
A companion to Anthropic's Site Reliability Agent cookbook. We keep that agent and change one thing — its read path.
If you've worked through the Anthropic cookbook, you have an incident-response agent that investigates a production issue, finds the root cause, and remediates it. To investigate, it calls about ten tools — Prometheus, container logs, config files, deploy history, alerts — one at a time, and correlates the results in its own context.
In this recipe we replace those ten reads with a single NavFlow query that returns the same signals already correlated, then add a trigger so NavFlow wakes the agent with the context attached. The agent's reasoning, action tools, and safety hooks are untouched. Every number below is a real run of the cookbook's agent against its own demo incident, with only the read path swapped.
What you'll do
- Run the cookbook's agent as a baseline and measure its read path.
- Point the agent at NavFlow and collapse the read path to one query.
- Add a trigger so NavFlow wakes the agent with the incident already attached.
Prerequisites
- The Anthropic SRE cookbook running locally — its Docker demo (api-server, Postgres, Prometheus, a traffic generator) is the system we break and fix.
- The Claude Agent SDK and an Anthropic API key.
- A NavFlow project — its MCP endpoint and an API key.
The starting point: the cookbook's read path
The demo incident is a classic one: a deploy reduces the api-server's database connection pool from 20 to 1. With traffic flowing, the single connection is overwhelmed, 500s spike, and p99 latency blows past a second. In the cookbook, the agent investigates by fanning out across its tools:
# Each signal is a separate round-trip; the agent merges them in its head.
health = await call_tool("get_service_health", {})
alerts = await call_tool("get_alerts", {})
metrics = await call_tool("query_metrics", {"promql": 'rate(http_requests_total{status="500"}[1m])'})
latency = await call_tool("query_metrics", {"promql": "histogram_quantile(0.99, ...)"})
pool = await call_tool("query_metrics", {"promql": "db_connections_active"})
config = await call_tool("read_config_file", {"path": "config/api-server.env"})
logs = await call_tool("get_container_logs", {"container": "api-server"})
deploys = await call_tool("get_recent_deployments", {})
# ... about ten calls, then a manual time-merge to find the causeIt works — the agent correctly traces the spike back to the deploy. But every call is a live round-trip, and the agent holds hundreds of metric series and thousands of log lines in context while it reasons.
Baseline, measured: 10 read calls, 12 agent turns, ~126k tokens of context, $0.66, 51 seconds (claude-opus-4-8).
What the agent actually saw
These are not illustrative figures. We ran the cookbook agent against its demo incident and captured the run. Here is the investigation it actually performed — ten read calls, in the order it made them:
1 get_service_health 2 get_alerts 3 get_recent_deployments 4 query_metrics 5 query_metrics 6 query_metrics 7 read_config_file config/api-server.env 8 read_config_file config/api-server.env.backup 9 list_metrics 10 get_container_logs api-server
Each call returned raw signal that the agent then had to correlate itself. The metrics it pulled from Prometheus while the incident was live:
500 rate user-svc = 5.16 /s (baseline ~0.1) 500 rate payment-svc = 0.08 /s 500 rate auth-svc = 0.02 /s p99 latency = 4901 ms db_pool_size = 1 db_connections_active = 1 db_connections_max = 1
And the api-server container logs it read — the smoking gun:
2026-06-09 11:44:41 ERROR Connection pool exhausted: QueuePool limit of
size 1 overflow 0 reached, connection timed out,
timeout 2.00 (sqlalche.me/e/20/3o7r)
172.19.0.7 "GET /api/users HTTP/1.1" 500 Internal Server ErrorTen calls and a manual merge later, the agent had its answer: the deploy cut the pool to 1. That correlation — across metrics, logs, config, and deploys — is the work NavFlow moves server-side.
Step 1 · Point the agent at NavFlow
NavFlow ingests the same systems once and exposes one tool over MCP. Replace the ten read tools in the agent's options with the NavFlow server:
from claude_agent_sdk import ClaudeAgentOptions, query
options = ClaudeAgentOptions(
system_prompt=SRE_SYSTEM_PROMPT,
mcp_servers={
"navflow": {
"url": "https://mcp.navflow.ai/<your-project>",
"headers": {"Authorization": f"Bearer {NAVFLOW_KEY}"},
}
},
allowed_tools=["mcp__navflow__query"], # the read tools are gone
model="claude-opus-4-8",
)Nothing else about the agent changes — same system prompt, same action tools, same hooks.
Step 2 · One query instead of ten
Now the agent reads the whole incident in a single call, keyed by service and windowed:
result = await call_tool("query", {
"view": "service_timeline",
"key": "api-server",
"window": "15m",
})NavFlow returns the signals already merged and time-ordered — metrics, logs, the config change, the deploy, and the alert in one payload:
=== service_timeline · key=api-server · window=15m ===
[ T-60s ] [deployments] deploy #1847 by alice — "Reduce DB connection pool size
for staging parity" (commit a7f3d2e), ~60s before symptoms
[ T-60s ] [config ] config change: DB_POOL_SIZE 20 -> 1 (drifted from known-good)
[ T-1s ] [docker-logs] ERROR Connection pool exhausted: QueuePool limit of size 1
overflow 0 reached, connection timed out, timeout 2.00
[ now ] [prometheus ] 500 rate user-svc = 5.16/s · p99 = 4901ms ·
db_pool_size = 1 · db_connections_active = 1
[ now ] [alerts ] FIRING (critical): DBConnectionPoolExhausted on api-serverThe causal chain is already in order: the deploy cut the pool to 1, the pool exhausted, requests timed out, 500s spiked. The agent reads one organized view and concludes, instead of fanning out and reassembling. It reaches the same root cause.
With NavFlow, measured: 1 read call, 3 turns, ~97k tokens, $0.15, 23 seconds — same diagnosis.
Step 3 · Let NavFlow trigger the agent
The read path was the pull. NavFlow also handles the push. Instead of waiting for PagerDuty and cold-starting the agent, define a condition over the live stream. This lives in NavFlow, not your agent:
{
"name": "error_spike_after_deploy",
"source": { "view": "service_timeline" },
"when": {
"predicate": "error_rate > 0.05 within 10m of deploy_event",
"group_by": ["service"]
},
"emit": "incident_opened"
}On the agent side, handle the event. NavFlow attaches the windowed, correlated timeline as the payload, so the agent boots holding the context and fetches nothing:
async def on_navflow_trigger(event):
# event.payload is the same service_timeline, attached by NavFlow
prompt = (
"NavFlow trigger fired: error_spike_after_deploy on api-server. "
"The correlated timeline is attached — confirm the root cause "
"and recommend the fix.\n\n" + event.payload
)
async for message in query(prompt=prompt, options=options):
handle(message)Triggered, measured: the condition fired the instant the spike registered and the agent confirmed the root cause in 0 read calls, 1 turn, $0.25, 18 seconds — no PagerDuty, no human, no fan-out.
What we measured
All three are the same agent on the same incident. Only the read path differs.
| Metric | Cookbook | NavFlow query | NavFlow triggered |
|---|---|---|---|
| Read calls | 10 | 1 | 0 |
| Agent turns | 12 | 3 | 1 |
| Context (tokens) | ~126k | ~97k | least |
| Cost | $0.66 | $0.15 | $0.25 |
| Wall clock | 51s | 23s | 18s |
| Root cause | found | found | found |
Ten reads collapse to one, and the cost with them — roughly a quarter of the baseline; with a trigger, the agent wakes already holding the answer. The same swap, measured across four different faults, is in the extended benchmark.
What stays the same
NavFlow only takes the read path. Everything that makes the agent yours is untouched:
- The agentic loop (Claude Agent SDK or any MCP-aware runtime).
- Action tools:
edit_config_file,run_shell_command,write_postmortem. - Safety hooks (
PreToolUsevalidators, command allowlists). - Your system prompt and reasoning.
The remediation phase — edit, redeploy, verify, post-mortem — runs exactly as it does in the cookbook. You are not rebuilding the agent; you are giving it a saner read path.
Recap
Starting from Anthropic's SRE agent, we swapped its read tools for one NavFlow query and added a stream trigger. The agent reached the same root cause in every run, while the read path went from 10 calls to 1 to 0. The investigation logic stayed the same; only where the data comes from changed.
Next: the extended benchmark stages four distinct faults — pool exhaustion, latency, a bad feature flag, and a dependency outage — each behind a deliberately vague deploy, and runs this same comparison on every one.
Get a NavFlow project at cloud.navflow.ai.