navflow · benchmark
One read path, four incidents
An incident-response agent that investigates a production issue spends most of its calls just gathering context — fanning out across metrics, logs, config, and deploys one tool at a time. We measured what happens when you replace that whole fan-out with a single correlated query.
To make the test fair we needed a system that can break in more than one way. So we rebuilt the testbed from Anthropic's Site Reliability Agent cookbook into something harder: a running app (FastAPI, Postgres, Prometheus, Grafana) that stages four independent faults, each hidden behind a deliberately vague deploy. Then we ran the same agent against each fault two ways and measured every run on a real Anthropic API key.
The testbed, and why we rebuilt it
Anthropic's cookbook is a great demo, but as a benchmark it has two properties that make NavFlow look good for the wrong reasons:
- One lever, one incident. Its api-server exposes a single knob — the DB connection pool size — so it can only ever stage the same pool-exhaustion incident. A win on one fault could just be luck.
- The deploy names the bug. Its changelog says
"Reduce DB connection pool size for staging parity"— the commit message is the answer. An agent doesn't have to reason; it reads the diagnosis off the deploy log.
Our rebuild keeps the same clean Prometheus/Grafana shape but fixes both. The api-server now exposes four independent fault levers, injectable at runtime with no redeploy, each producing a distinguishable signature in the metrics and logs:
| Incident | What the agent sees |
|---|---|
| DB pool exhaustion | pool-exhausted logs, db_connections pinned at 1, 500s on /api/users |
| Latency regression | p99 blows past the request timeout, pool itself healthy |
| Error spike (bad flag) | app-exception logs (KeyError: user_tier), DB healthy |
| Dependency outage | dependency_up = 0, upstream-unreachable on /api/orders |
The deploys are vague on purpose
Each fault is preceded by a deploy in the changelog — the correlation signal a real SRE leans on. But unlike the cookbook, the commit messages never name the knob. They read like real commits, so the agent has to connect the deploy to the symptoms instead of reading the answer:
fault deploy commit (what the agent sees) what it hid ────────────────────────────────────────────────────────────────────────────────────── db_pool exhaustion "Align resource limits with staging environment" db_pool_size 20 → 1 latency regression "Add synchronous audit-log write to user lookup" inject_latency_ms 0 → 800 error spike "Ship new pricing-tier feature flag" error_rate 0 → 0.30 dependency outage "Roll out v2 checkout integration" payments-api forced down
This matters because it tests the thing that's easy to fake: whether a correlated read actually helps the agent reason, or just lowers a call count. The agent gets the same generic prompt every time — “something is wrong, investigate” — with no hint of which fault is live.
The two read paths
The two agents are byte-for-byte identical — same loop, same model, same system and task prompts, same grading. The only difference is the tools they're handed:
- Baseline — the provider-style read path: each signal is its own tool (service health, metrics, logs, config, deploys), and the agent fans out across them every investigation.
- NavFlow — the same agent with the fan-out tools replaced by one
query(view, key, window). NavFlow ingests the same five signals into its own store and serves them back as one time-ordered, correlated timeline — a single read.
What we measured
One run per incident per path — eight runs total — each on a real Anthropic API key (not subscription auth), so the cost column is actual billing. Every run got a fresh nonce in its prompt to force a cold cache, so no run is flattered by a previous one's warm cache. Baseline and NavFlow are shown side by side.
| Incident | Baseline | NavFlow | Root | ||||
|---|---|---|---|---|---|---|---|
| reads | turns | cost | reads | turns | cost | ||
| db_pool_exhaustion | 7 | 9 | $0.51 | 1 | 3 | $0.13 | both |
| latency_regression | 11 | 13 | $0.23 | 1 | 3 | $0.15 | both |
| error_spike | 9 | 11 | $0.23 | 1 | 3 | $0.14 | both |
| dependency_outage | 8 | 10 | $0.21 | 1 | 3 | $0.16 | both |
| total | 35 | 43 | $1.17 | 4 | 12 | $0.58 | 8/8 |
Across all four faults: 8.8× fewer reads, 3.6× fewer turns, ~2× cheaper — and the same root cause found in all eight runs.
The number that holds up isn't any single row — it's the shape. NavFlow is dead flat at 1 read / 3 turns on every incident, no matter which fault is live. The baseline swings from 7 to 11 reads depending on how hard the fault is to disambiguate. The correlated query removes the search variance, not just the call count.
The same answer, from one read
Here is the NavFlow agent on the latency regression — the one whose deploy said "Add synchronous audit-log write to user lookup", never inject_latency_ms = 800. From one query it reasoned the whole chain:
Root cause: the deploy by bob introduced a synchronous audit-log write into the user-lookup path. That extra blocking DB op pushed p99 past the request timeout, so /api/users and the user-dependent /api/orders time out and return 500. This is a latency-induced timeout, NOT pool exhaustion — the pool is healthy, only a few connections active.
It connected a vague commit to the p99 metric and the timeout config, and explicitly ruled out the pool-exhaustion red herring — the kind of correlation the baseline reassembles by hand across a dozen separate reads.
How to read these numbers
- Reads and turns are the real result. They're cache-independent and dead stable: 35 → 4 reads, 43 → 12 turns, with NavFlow fixed at 1 / 3 every time. Lead with these.
- Cost is noisier than it looks. The db_pool baseline ($0.51) is a first-run cache outlier — the other three baselines sit near $0.22. Drop it and the cost gap is ~1.5×, not 2×. We cite cost as a supporting figure, never the headline.
- Tokens are not a NavFlow win. One query returns a fat correlated payload; on the db_pool incident NavFlow actually read more cached context than the baseline. Context size is not a reliable axis here, so we don't claim it.
- Accuracy is the floor. If the cheaper path got the wrong answer, none of the above would matter. It didn't — 8/8, both paths, including the three faults the original cookbook can't even stage.
What this shows
Across four independent faults, hidden behind commit messages that never name the bug, the same SRE agent reached the same root cause whether it fanned out across a handful of tools or made one NavFlow query. The read path went from 35 calls to 4 over the set, turns from 43 to 12, and NavFlow held a flat 1 read / 3 turns regardless of how messy the incident was. The investigation logic never changed — only where the data comes from did. You don't rebuild the agent; you give it a saner read path.
Get a NavFlow project at cloud.navflow.ai.