navflow for ai sre

Your SRE agent makes eight calls to know what just happened. NavFlow answers in one.

If you built on the Anthropic SRE cookbook, your agent wraps Prometheus, PagerDuty, and GitHub in a dozen MCP tools, then calls them one at a time and stitches the results together in its head. NavFlow pulls those sources in and keeps them ready, so your agent makes one call instead of eight — faster resolution, smaller token bill, same agent.

Get started Read the cookbook

Free tier · no credit card required

without navflow

# 8 tool calls / loop

query_metrics("api")

get_logs("api")

get_alerts("api")

get_deploys("api")

# ...4 more, then

# join by hand

with navflow

# one MCP read

query(

view="service_timeline",

key="api",

window="30m"

)

# metrics+logs+

# alerts+deploys,

# time-ordered

§ 01/the pattern you built today

You followed the cookbook. This is the shape you ended up with.

A PagerDuty webhook wakes your agent. It reasons, calls a tool, reads a vendor API, reasons again, calls the next one, and eventually edits a config or opens a PR. It works. We built one too. Here is the read path it leaves you with.

PagerDuty webhook ─► agent loop12+ MCP tools · one vendor round-trip each

query_metrics()Prometheus HTTP API
get_logs()Loki / Datadog
get_alerts()Alerts API
get_recent_deploys()GitHub API
get_pod_status()Docker / k8s
read_runbook()file system

Plus the action tools, the safety hooks, and the join logic the agent runs in its head on every loop.

§ 02/where it breaks under load

It works in the demo. Then production load arrives.

01
Vendor API rate limits compound.
Every tool call hits a live API. Run the agent at 100 alerts a day across four services and you start throttling.
02
The agent does the joins in its head.
“Everything on service X in the last 30 minutes” is four tool calls and a manual time-merge inside the reasoning loop. Slow, and lossy.
03
Every invocation starts cold.
No memory across incidents. The agent rediscovers the same patterns every time it pages.
04
You react, you do not predict.
The agent runs when PagerDuty fires, which means it runs on whatever PagerDuty thinks is alertable. Real anomalies show up earlier than that.
05
More stacks mean more tools.
Customer A on Prometheus, customer B on Honeycomb. Each new source is another MCP tool to write and keep alive.
06
Replay is forensics.
When the agent acts wrong at 3am, the vendor APIs already aggregated the evidence away. You cannot reconstruct what it saw.

§ 03/the architecture navflow gives you

Keep your agent. Replace its read path.

NavFlow sits between your sources and your agent. Sources stream in and land losslessly, get normalized into keyed, windowed views, and are served to your agent through one MCP endpoint. Triggers fire on conditions you define over the live stream, not only when PagerDuty does.

Raw storage (S3)

Every event lands losslessly. Replay any window, audit anything the agent acted on.

Derived layer

Typed, keyed, windowed views — correlate a deploy with an incident in one query. The agent reads shapes that fit its reasoning loop, not raw payloads.

Catalog

What sources and views exist, their schema, their freshness, what reads from what. You browse it; your agent introspects it.

Trigger engine

Define a condition over the live stream. It fires when the condition is met. PagerDuty becomes one signal, not the only one.

§ 04/triggers on real conditions

Fire on what is happening. Not on what PagerDuty noticed.

Most AI SRE setups wait for PagerDuty. By then it is already noisy. Define a trigger on any condition over the live stream, and NavFlow wakes your agent with the payload attached.

trigger · db_pool_saturation.json

{

"name": "db_pool_saturation",

"source": { "view": "service_health" },

"when": {

"field": "used / max",

"predicate": "> 0.85",

"group_by": ["service"],

"window": "5m"

"emit": "db_pool_saturation"

}

trigger · deploy_regression.json

{

"name": "deploy_regression",

"source": { "view": "cross_source_timeline" },

"when": {

"predicate":

"errors > 50 within 10m of deploy",

"group_by": ["service"]

"emit": "deploy_regression"

}

PagerDuty can still trigger your agent. It just becomes one signal of many, and the predictive ones fire before PagerDuty would.

§ 05/a catalog you both can query

Every source and view in one catalog. You browse it. Your agent introspects it.

agent calls catalog.list →

{ "entries": [

{ "handle": "source:otel_traces",

"type": "event_stream",

"freshness": { "lag_s": 0.4 } },

{ "handle": "view:service_health",

"schema": { service, error_rate, p99_ms },

"freshness": { "lag_s": 1.2 } },

{ "handle": "trigger:db_pool_saturation",

"type": "trigger" }

] }

Lineage

Every derived view points back to its sources. view:service_health knows it reads from otel_metrics and otel_traces. Audit any path.

Freshness

The agent knows whether a view is current before it trusts it. No silently stale data.

Discoverability

A new agent joining the system reads the catalog and learns what exists. No tribal knowledge.

§ 06/four problems this solves

Move the read path. This is what changes.

No context confusion, better answers

More context makes agents worse, not better — dump raw payloads in and the model gets lost in the noise. NavFlow serves keyed, windowed views, so your agent reads only what's relevant to the incident and spends its budget reasoning, not parsing.

Faster and cheaper

One MCP read replaces four to eight vendor calls per loop. Sub-second instead of 5 to 20 seconds, and the token spend on fetching drops sharply.

One query, any source

The query looks the same whether you read OTel traces, GitHub deploys, or chat logs. New source, same query shape. No new tool per source.

Deterministic, still exploratory

The data layer is deterministic: the same query returns the same result. Your agent’s reasoning loop stays as exploratory as you want.

the obvious objections

The questions a builder asks before changing anything.

More in the general FAQ, or email hello@navflow.ai.

Isn’t this just the cookbook with extra steps?
No. The cookbook keeps the data plane inside the agent’s tool calls. NavFlow makes the data plane its own layer. The agent stops orchestrating queries and goes back to reasoning. Your action tools and safety hooks do not move.
What happens to my Datadog / Honeycomb / Prometheus stack?
They become ingest sources. Your agent’s query interface stays the same when you swap a vendor or add one. Customer A on Prometheus and customer B on Honeycomb read through the identical query.
Do I have to adopt a specific store or queue?
No. NavFlow gives you the interface: ingest, query, trigger, served over MCP. The storage and streaming underneath are an operational detail, not something your agent code imports.

Your SRE agent makes eight calls to know what just happened. NavFlow answers in one.

You followed the cookbook. This is the shape you ended up with.

It works in the demo. Then production load arrives.

Vendor API rate limits compound.

The agent does the joins in its head.

Every invocation starts cold.

You react, you do not predict.

More stacks mean more tools.

Replay is forensics.