All articles

You can't trust what you can't see: How we keep an eye on a fleet of AI agents

Taylor Bloom, Senior Technical Program Manager

A little while ago, my colleague Maxime Gréau wrote a post with one of the most honest titles we've ever shipped: "This Shit is Hard: How AI keeps our code on standard." If you haven't read it, read it; it's the prequel to this one. The short version: we built a fleet of single-purpose AI agents that continuously drag a 500+ module monorepo back onto standard. They open pull requests by the thousands. The safe ones merge themselves. No human in the loop for the low-risk changes.

When I describe that to people outside the company, the reaction is almost always the same. Not "wow," but a slightly nervous "…and you trust that?"

It's the right question. And the honest answer is: we don't trust it because it's AI. We trust it because we can see it. Every token it burns, every model it calls, every eval it passes or fails, every individual decision it makes on a specific commit - all of it is on a screen, in front of a human, in seconds.

That visibility layer is a thing. It has a name. We call it Lens, and this is the post about how we watch the machines.

Autonomy without observability is just hope

Here's the failure mode I was determined to avoid.

You build an agent. It works in the demo. You give it the keys: let it open PRs and merge the safe ones. For a week, it's magic. Then one Tuesday, the eval scores quietly drift, or a model version bumps under you, or one agent starts retrying a doomed fix in a loop and burning tokens like a space heater. Nobody notices, because the only evidence is scattered across five different tools: the PR is in GitHub, the cost is in a GCP billing export, the trace is in BigQuery, the incident is in Linear, and the "huh, that's weird" is in somebody's Slack DMs.

By the time the signal assembles itself into something a human can act on, you've either spent a lot of money or merged something you shouldn't have. Autonomy without observability is hope with a budget.

The thing about a fleet of agents - reviewer, skillfixer, autofixer, loganalyzer, judge - all running continuously is that the questions you need answered aren't hard. They're just questions nobody can answer fast enough:

  • Which agent is spending the most, and on what?

  • Are the evals getting better or worse since we shipped the new version?

  • This PR comment says the agent failed. Why, exactly, on this commit?

  • Is this month's bill a blip or a trend?

None of those requires a data scientist. They require a place to look. We didn't have one. So we built one.

One lens

Lens is Chainguard's home for internal dashboards: fast, purpose-built views across our dev tools and workflows. It's a single hub (one login, one URL) with a couple dozen dashboards mounted under it, each one a focused answer to a specific question somebody actually asks. I built that hub - the login, the caching, the shared data layer every dashboard pulls from - to be the boring, reliable floor the rest of this stands on. None of what follows works without it.

And the vision behind it was simple, and a little stubborn: one place. I wasn’t building a place for the agent team, not a place for leadership, not a place for whoever happens to have BigQuery access and knows the right query. I was building one place where anyone at Chainguard can log in and, in seconds, see what's actually happening. What the agents are doing right now. What they did overnight. What it cost, whether it worked, and where it got stuck. Visibility into our own systems shouldn't be a privilege you earn by knowing which dashboard to bookmark. It should be the default and the same view for everyone. That's the whole idea: take the truth that was scattered across a dozen tools and behind a dozen permissions, and put it on one screen the entire company can open.

Some of those dashboards point to humans and processes: engineering project health, active incidents, quarterly planning, post-mortem action items, and PRs waiting on your review. But the ones I want to talk about here are the ones that point at the fleet. These are the dashboards that took our fleet of agents from "trust me, it works" to "here, look."

The trick that makes Lens useful is that the evidence stops being scattered. The trace, the cost, the eval, the failure, the trend - they live in the same place, behind the same login, and they link to each other. When the answer to "is this safe?" lives in one tab instead of five tools, that question stops being scary.

Figure 1: Five disconnected tools (GitHub, GCP billing, BigQuery, Linear, Slack DMs) converging into one Lens hub.

Seeing the fleet think: LensAgentTraces

The heart of all this is a dashboard called LensAgentTraces, which is owned by our developer platform team.

The agents run on DriftlessAF, our open source agentic reconciler framework, and every run emits a trace: which agent, which model, how many input and output tokens, what it cost, how long it took, whether it threw. Those traces land in BigQuery. LensAgentTraces turns them into something you can read at a glance, across five tabs:

  • Overview: The top-line: total traces, tokens in and out, total cost, error rate. The "is anything on fire right now?" view.

  • By Agent: The same numbers, broken out per agent. This is how you catch the one agent quietly costing 10x its neighbors, or the one whose error rate crept up after a change.

  • By Model: Performance and spend sliced by model. When you're running a mix of Claude and Gemini versions across a fleet, "which model is actually pulling its weight per dollar?" is a question you want answered with a chart.

  • Failures & Quality: Exceptions and failure patterns surfaced instead of being buried. The agents fail sometimes; that's fine. Failing silently is not fine. This tab makes failure loud.

  • Explorer: The drill-down. Pick a single trace and see the whole thing: the request, the response, the token counts, the cost, the latency, the timestamp.

That last tab is where observability stops being a poster and becomes a tool, thanks to one small feature I'm disproportionately proud of: deep links.

When an agent fails a PR, it leaves a comment containing a link to the Explorer tab, pre-loaded with the exact trace. The URL carries three parameters: ?env=, ?trace=, and ?day=. The env tells Lens which environment to read; the trace ID lands you on the failing run; the day prunes the BigQuery partition so the lookup is fast instead of scanning a month of data. You go from "the agent left a cryptic comment on my PR" to "I'm staring at the exact reasoning that produced it" in one click. No copying trace IDs. No guessing which environment. No BigQuery console.

And it reads from three environments behind a single toggle - prod, staging, and presubmit (the checks that run before a change ever lands) - so the same dashboard answers "how's the fleet doing in production?" and "is this change about to make things worse?" before it ever ships. The presubmit view is wired to the same trace data that CI generates, so you can watch a not-yet-merged change's agent behavior with the exact tooling you'd use in prod.

Watching the pipeline in flight: LensOrchestrator

Traces are an autopsy. Sometimes you don't want to know why a run died yesterday. You want to know what's happening right now: what's queued, what's mid-flight, what's quietly stuck.

LensOrchestrator - one I built - is the live view. It's a kanban of agent orchestrations moving through the pipeline, cards grouped by phase (setup, executing, blocked, done) with a freshness indicator that flags any orchestration that's gone quiet, so a run that stalled doesn't just sit there invisibly. Open one and its full history color-codes every transition by who made it: the materializer (the process that turns an agent's decision into an actual pull request), another bot, or a human.

My favorite part is a reconciler called Lookout. Lookout is a heartbeat that does nothing but audit stranded orchestrations. A parent that got stuck with no open PR, a child that hit a terminal state but never got picked back up, a step that quietly missed its cue: Lookout finds them and either re-enqueues them or flags them for a human. The watcher has a watcher.

Grading the graders: LensAgentEvals

Traces tell you what the agents did. They don't tell you whether what they did was any good. For that, you need evals, and you need to watch the evals the same way you watch everything else.

This is the part people underestimate. The skills the agents apply are code that is versioned, reviewed, and eval-tested, not vibes-based prompts. But an eval suite is only as useful as your ability to notice when it moves. A score that silently slips from 0.9 to 0.7 over three weeks is exactly the kind of slow leak that sinks a system built on trust.

So there's a second dashboard, LensAgentEvals, that does for quality what LensAgentTraces does for cost and behavior. It reads eval results out of BigQuery and lets you slice them every way that matters:

  • By Agent: Which agents are scoring well, and which are sliding?

  • By Scorer: Different scorers grade different properties; this is how you see which dimension of quality moved, not just that the average did.

  • By Model: The same eval, across models. Essential when you're deciding whether a model upgrade is actually an upgrade.

  • Cases: Individual eval cases, with inputs, outputs, and scores. The place you go when an aggregate looks wrong, and you need to see the specific example that broke it.

  • Compare: Side-by-side benchmarking, so you can put a candidate version next to the incumbent before you promote it.

That compare-before-you-promote workflow is the whole point. We don't ship a new agent version because the demo looked good. We ship it because we put its eval scores next to the current version's, on the same screen, and the new one won. The graders get graded, and the grading is visible to everyone.

Following the money: LensCloudCosts

Here's the unglamorous truth: a fleet of autonomous agents has a bill, and that bill has your name on it.

Every token is a fraction of a cent, and a fleet running continuously across hundreds of modules turns fractions of cents into a real line item fast. If you can't see the spend broken down by who, what, and where, you find out about it the way everyone dreads: at the end of the month, in aggregate, with no way to attribute it.

LensCloudCosts is the dashboard that keeps that honest. It pulls our exported GCP billing data and turns AI spend into something you can actually reason about: cost by engineer, by model, by service, with a Sankey diagram that traces the flow from project to service to person, a spend trend over time, and per-engineer breakdowns. You can export it to CSV or PNG when finance asks.

This deliberately lives in the same hub as the traces and evals. When you can see, on adjacent tabs, that an agent's cost just spiked and its eval scores just dropped, you have a story. Cost without quality context is just a number to panic about. You have to bring them together.

Closing the loop: LensAutomatedMerge

We’ve claimed before that the safe PRs merge themselves with no human in the loop. We built a dashboard for that, too: LensAutomatedMerge.

LensAutomatedMerge tracks the auto-merge program against its goals: how much of the fleet's output is actually merging automatically, and, more usefully, where the coverage gaps are. Which repos or change-types still need a human, and why? That gap analysis is how an auto-merge program grows safely: you don't widen the blast radius on a hunch; you do so where the data already says the risk is low. It's the difference between "we auto-merge stuff" and "here's exactly how much, and here's the next safe place to expand."

And there's a companion. The same fleet that opens PRs now has an agent that reviews them - a read-only reviewer that reads a change, reasons about it, and leaves inline findings, never touching the code itself. The dashboard answers the question that sits underneath the whole auto-merge program: Can you trust the reviewer enough to let it gate a merge? It grades the reviewer the way we grade every other agent: findings per change, how often it agrees with a human, whether it holds its severity calibration instead of crying wolf, and - the part I find most honest - its red-team catch-rate: the adversarial changes it's supposed to flag, and whether it actually did. A reviewer you can't audit is just a rubber stamp with extra steps. This one gets watched like everything else.

The same lens points at humans, too

I've focused on the fleet, but it's worth noting: the agents aren't the only things worth watching, and Lens doesn't just watch machines.

The same hub hosts the dashboards our engineering org runs on day to day - a weekly executive view with a health score and burndown, active incident tracking across teams, quarterly planning with cross-team dependency graphs and staffing load, the post-mortem action items nobody wants to lose track of, and the simple, beloved "which PRs are waiting on me" review queue. There's even a project-tracking view (LensLinear) for the roadmap of building the agents because the work of shipping autonomy is itself a project worth watching: health, not just hardware.

That's the whole vision. The reason the agent dashboards feel natural rather than exotic is that, at Chainguard, watching the work, whether a person or a process does the work, is how we operate. And it all lives in one hub, open to the whole company: the same login that shows a leader the weekly health score shows an engineer exactly what the agents shipped overnight. One spot, one source of truth, for everyone. The agents earned a seat on the same dashboard where everything else lives. They don't get a pass on visibility because they're AI, and nobody has to go hunting across tools to find out what they did.

Any engineer can add a view

The last thing that makes Lens work is a feature of its architecture. Observability dies when adding a new view requires a ticket, a meeting, and a sprint. So I built it to be cheap.

I designed Lens as a set of modular dashboard packages: a React + Vite frontend per dashboard, a shared Express hub I built to handle auth, caching, and data APIs, and a common library of design tokens and utilities to ensure everything looks and behaves consistently. Adding a brand-new dashboard mounted at its own URL is genuinely a four-touch change: scaffold the package by copying a small existing one, wire it into the dev workspace, add a few lines to have the build pack it into the container image, and register a route on the hub. If you need data, you add an authenticated endpoint to the hub server.

Figure 2: Lens architecture: React dashboards are served by a shared Express hub that handles auth, OAuth passthrough, caching, and queries. Using your OAuth token, it securely accesses GCP (BigQuery, Billing) and SaaS (Linear, GitHub API) data.

That's the difference between observability as a project and observability as a habit. The proof is in who actually owns the dashboards. Maxime's agent traces and evals, sure - but also Massimiliano Giovagnoli's view into the manifest-gen bot's PR lifecycle and quality, Nicholas Skaggs's dashboards for platform-scaling and GitHub API consumption (because a fleet of bots hammering the GitHub API has a budget too), and the security team's vulnerability findings and roadmap views from Aidan Manley and Jorge Lucangeli. Nobody filed a ticket asking for these. The engineer who needed the answer built the view - which is exactly why the dashboards that matter most belong to the person who needed them, not to a central team that owns "dashboards."

Under the hood, honestly

A few engineering notes, because the unglamorous parts are where the trust actually comes from.

The agent dashboards query BigQuery on your behalf, using your own OAuth token, so you only ever see data you're allowed to see, and the hub transparently refreshes that token when it expires, so you never get bounced to a re-auth screen mid-investigation. The view names are resolved through a strict allowlist rather than string interpolation, because "user-facing dashboard" plus "raw query parameters" plus "no allowlist" is how you get an SQL injection in a security company's blog post for all the wrong reasons. And when two people hit the same expensive endpoint at the same time, the hub coalesces the requests into a single request. The second viewer joins the in-flight query instead of doubling the load on Linear or BigQuery.

Where it's going

The pattern underneath Lens is bigger than any one dashboard. It's this: the harder a system is to trust, the more it needs to be seen. Autonomous agents are the hardest to trust, which is exactly why they're the most closely watched.

As the fleet grows — more agents, more models, more of the codebase under continuous, automated care — the lens widens with it. New agents get their traces and evals on the dashboard the day they ship, not the quarter after something goes wrong. New questions become new tabs in an afternoon. The goal is that nobody at Chainguard ever has to take an agent's word for it.

We built a fleet of agents to keep our code in line with standards. I built Lens so the whole company could look them dead in the eye while they do it. Because you can't trust what you can't see, and now anyone at Chainguard can see all of it, in one place.

If you're curious about how our agents work, how we keep them honest, or how Chainguard's solutions can help your team, reach out. We love talking about this stuff.

Share this article

Related articles

Want to learn more about Chainguard?

Contact us