navflow · benchmark

One read path, four incidents

An incident-response agent that investigates a production issue spends most of its calls just gathering context — fanning out across metrics, logs, config, and deploys one tool at a time. We measured what happens when you replace that whole fan-out with a single correlated query.

To make the test fair we needed a system that can break in more than one way. So we rebuilt the testbed from Anthropic's Site Reliability Agent cookbook into something harder: a running app (FastAPI, Postgres, Prometheus, Grafana) that stages four independent faults, each hidden behind a deliberately vague deploy. Then we ran the same agent against each fault two ways and measured every run on a real Anthropic API key.

The testbed, and why we rebuilt it

Anthropic's cookbook is a great demo, but as a benchmark it has two properties that make NavFlow look good for the wrong reasons:

Our rebuild keeps the same clean Prometheus/Grafana shape but fixes both. The api-server now exposes four independent fault levers, injectable at runtime with no redeploy, each producing a distinguishable signature in the metrics and logs:

IncidentWhat the agent sees
DB pool exhaustionpool-exhausted logs, db_connections pinned at 1, 500s on /api/users
Latency regressionp99 blows past the request timeout, pool itself healthy
Error spike (bad flag)app-exception logs (KeyError: user_tier), DB healthy
Dependency outagedependency_up = 0, upstream-unreachable on /api/orders

The deploys are vague on purpose

Each fault is preceded by a deploy in the changelog — the correlation signal a real SRE leans on. But unlike the cookbook, the commit messages never name the knob. They read like real commits, so the agent has to connect the deploy to the symptoms instead of reading the answer:

changelog · what the deploy says vs what it did
fault                deploy commit (what the agent sees)                what it hid
──────────────────────────────────────────────────────────────────────────────────────
db_pool exhaustion   "Align resource limits with staging environment"   db_pool_size 20 → 1
latency regression   "Add synchronous audit-log write to user lookup"   inject_latency_ms 0 → 800
error spike          "Ship new pricing-tier feature flag"               error_rate 0 → 0.30
dependency outage    "Roll out v2 checkout integration"                 payments-api forced down

This matters because it tests the thing that's easy to fake: whether a correlated read actually helps the agent reason, or just lowers a call count. The agent gets the same generic prompt every time — “something is wrong, investigate” — with no hint of which fault is live.

The two read paths

The two agents are byte-for-byte identical — same loop, same model, same system and task prompts, same grading. The only difference is the tools they're handed:

What we measured

One run per incident per path — eight runs total — each on a real Anthropic API key (not subscription auth), so the cost column is actual billing. Every run got a fresh nonce in its prompt to force a cold cache, so no run is flattered by a previous one's warm cache. Baseline and NavFlow are shown side by side.

IncidentBaselineNavFlowRoot
readsturnscostreadsturnscost
db_pool_exhaustion79$0.5113$0.13both
latency_regression1113$0.2313$0.15both
error_spike911$0.2313$0.14both
dependency_outage810$0.2113$0.16both
total3543$1.17412$0.588/8

Across all four faults: 8.8× fewer reads, 3.6× fewer turns, ~2× cheaper — and the same root cause found in all eight runs.

The number that holds up isn't any single row — it's the shape. NavFlow is dead flat at 1 read / 3 turns on every incident, no matter which fault is live. The baseline swings from 7 to 11 reads depending on how hard the fault is to disambiguate. The correlated query removes the search variance, not just the call count.

The same answer, from one read

Here is the NavFlow agent on the latency regression — the one whose deploy said "Add synchronous audit-log write to user lookup", never inject_latency_ms = 800. From one query it reasoned the whole chain:

navflow agent · latency_regression (1 read, 3 turns)
Root cause: the deploy by bob introduced a synchronous audit-log
write into the user-lookup path. That extra blocking DB op pushed
p99 past the request timeout, so /api/users and the user-dependent
/api/orders time out and return 500. This is a latency-induced
timeout, NOT pool exhaustion — the pool is healthy, only a few
connections active.

It connected a vague commit to the p99 metric and the timeout config, and explicitly ruled out the pool-exhaustion red herring — the kind of correlation the baseline reassembles by hand across a dozen separate reads.

How to read these numbers

What this shows

Across four independent faults, hidden behind commit messages that never name the bug, the same SRE agent reached the same root cause whether it fanned out across a handful of tools or made one NavFlow query. The read path went from 35 calls to 4 over the set, turns from 43 to 12, and NavFlow held a flat 1 read / 3 turns regardless of how messy the incident was. The investigation logic never changed — only where the data comes from did. You don't rebuild the agent; you give it a saner read path.

Get a NavFlow project at cloud.navflow.ai.