What failed and what went undetected
If you've ever run an agent in production, you've probably noticed how fragile it is. A simple, seemingly insignificant change in its system prompt can break the entire behavior and tank the quality.
We changed three words in one YAML field of the agent's system prompt. Two of our four evaluators flipped from PASS to FAIL. From the outside, nothing looked wrong. The agent answered every question, latency stayed flat, and none of our SRE dashboards would have caught it.
That's the easy case. A config change you made yourself, that you can trace because you wrote the diff. But what if the upstream MCP server bumps its version under you? What then? Agentic AI has a lot of blind spots, and a lot of surfaces still waiting on a standard.
Anthropic's April 2026 postmortem is a useful read for the scale of this. Three separate infrastructure bugs degrading Claude's responses for weeks, and even an org like Anthropic needed a long incident to track them down. The rest of us are not in better shape.
We used kagent, deployed in our lab, with agentevals wired in as the evaluation backend for the experiments in this post. We wrote a handful of deterministic Python evaluators, no LLM judge in the loop. Our first instinct was the obvious one: hash the agent's configuration, alert on any change. That wasn't enough, and figuring out why is most of what this post is about.
So this post isn't about hashing alone. It's about what to hash, and how to know whether the agent is still the agent you deployed.
The argument is for two artifacts. A Profile is a continuous series of deterministic evaluations against a baseline configuration, so you can notice when the agent's behavior has drifted. A Fingerprint is a content-addressed identifier for that configuration, layered so that vendor-side noise doesn't invalidate the Profile on every reconcile.
We think this is the right minimum surface. Small enough to ship, large enough to catch the failures Phase 1 missed.
Phase 1 was the prototype inside kagent-controller. We shipped an AgentProfile CRD, exposed it at /agents/{ns}/{name}/.well-known/agent-profile.json, and put an ingest webhook on the other side. There's an EvalProvider Go interface too, with AgentevalsProvider as the first implementation behind it. Plumbing. Necessary, not interesting.
Phase 2 is what this post is actually arguing for. Once the wiring worked, it was clear the contract between kagent and the evaluator can't stay implicit. That means an AgentFingerprint, an EvalProvider promoted from a Go interface to a CRD that declares its own capabilities, and a real {Provenance, Snapshot} contract sitting between kagent and whoever is evaluating it.
Phase 1 showed that kagent can hand a profile out and take one back. Phase 2 is about whether the profile is worth anything once it's been handed over.
What HTTP 200s don't tell you
The changes described in Anthropic's postmortem are a useful test case, because they show the limits of the signals most platform teams already rely on.
Take them one by one:
- the default reasoning_effort lowered from high to medium - Could be easily identified with standard SRE practices.
- a caching bug that broke long-idle sessions - Would require a dedicated session lineage. A standard platform dashboard would likely miss it.
- a system-prompt instruction that capped how much the agent could say between tool calls - That one is the most interesting. The regression was not in what the agent returned, but in what it didn't. It skipped work, produced a shorter answer, and still returned a successful response. A histogram will not catch that. An eval might.
Figure 1: Three regressions from Anthropic's postmortem, mapped against classical platform metrics. Two of the three are invisible to anything most teams have wired up today. Only an eval against a baseline catches all three.
kagent already emits a set of gen_ai.* operation attributes on the agent span. They identify the operation, not its quality. The OpenTelemetry gen_ai.evaluation.result event (introduced in semconv v1.38.0, still Development status as of v1.41.1) is the attribute the category is converging on for quality. Neither kagent nor agentevals emit it today.
That leaves us with two missing pieces. The first is a behavioral signal that survives a successful HTTP response: a profile made of evals, run repeatedly against a baseline. The second is a content-addressed identifier for the configuration itself: a fingerprint that can tell us something changed before the eval signal even has a chance to fire.
An agent is not a microservice
The fastest way to see why "an agent" is more than a model and a prompt is to dump one from a live cluster.
apiVersion: kagent.dev/v1alpha2
kind: Agent
metadata:
name: k8s-agent
namespace: kagent
spec:
type: Declarative
declarative:
modelConfig: default-model-config
systemMessage: |
# Kubernetes AI Agent System Prompt
You are KubeAssist, ...
tools:
- type: McpServer
mcpServer: {name: kagent-tool-server, apiGroup: kagent.dev}
a2aConfig:
skills:
- id: cluster-diagnostics
- id: security-audit
It wouldn't be fair if we didn't compare an agent to the classical microservice.
Figure 2: A classic microservice exposes two surfaces a continuous-integration pipeline can pin - the image digest and a configuration object. A modern LLM agent exposes at least six, each with its own change clock and its own failure mode.
This is where we introduce the first artifact: AgentProfile.
apiVersion: kagent.dev/v1alpha2
kind: AgentProfile
metadata: {name: k8s-agent-profile, namespace: kagent}
spec:
agentRef: {name: k8s-agent}
evalSourceRef:
providerName: agentevals
evalSetRef: {kind: ConfigMap, name: default-eval-set}
fingerprint:
sum: "sha256:..."
behaviorCritical: {...}
runtime: {...}
status:
health: Healthy
snapshot: {capability: {...}, surface: {...}, stats: {...}, safety: {...}}
The AgentProfile CRD should be a deliberately stable contract. Stable for downstream tooling (auditors, admission controllers, dashboards) that needs to read it without redeploying on every kagent minor.
Eval, Profile, Fingerprint, EvalProvider
It's time to provide clear definitions before they start fighting each other.
Eval. One measurement of an agent's behavior on a single task, producing a structured result: PASS/FAIL, score, histogram of calls, latency, refusal. An eval is (evaluator, trace, config) -> RunResult. One eval is a number with provenance. It can't tell you whether the agent changed alone.
Profile. A series of Evals at a single, unchanged agent configuration.
Fingerprint. A content-addressed identifier for the configuration that produced a Profile. We split it into layers so vendor-side noise doesn't invalidate Profiles by accident. We'll come back to that in S7.
spec:
fingerprint:
sum: "sha256:..."
behaviorCritical:
modelConfig: "sha256:..."
systemPromptHash: "sha256:..."
toolset: [{name: ..., schemaHash: ...}]
mcpServers: [{name: ..., resolvedSchemaHash: ...}]
runtime:
runtimeImageDigest: "sha256:..."
otelExporterConfigHash: "..."
legacyConfigHash: "kagent.dev/config-hash" # kagent was using that annotation the old days.
EvalProvider. A pluggable abstraction over evaluation backends, with Solo.io's agentevals as the first implementation. The abstraction lets other backends plug in.
We identified two failure modes from this split. Drift is distributional and continuous: the Profile moves relative to the baseline while the Fingerprint stays unchanged, or the Fingerprint moves while the Profile holds. Invalidation is binary and happens at admission, when the Fingerprint Sum changes, so the Profile stays dead until next re-eval. Neither a model digest nor a git commit can replace this. An agent is a graph (see Figure 2). A single hash misses every edge.
Reverse-reference invariant. The proposal doesn't extend Agent. Nothing in Agent.spec knows about Profiles. A Profile can be reconciled, signed, attested, or queried without forcing every existing Agent CR to grow a field.
What this catches, and what it doesn't
Before we get into the detection experiment and the layered Fingerprint, it's worth pausing on this table. These vulnerabilities in agentic systems look like the ones the mechanisms in this post should be able to address easily.
Figure 3: Attack-class coverage by signal. The Profile catches behavioral drift (silent model swaps, persistent prompt injection, runtime poisoning). The Fingerprint catches configuration changes the moment they hit admission. Single-shot prompt injection sits outside both signals by design.
What the evaluators caught
We ran a small experiment to see whether a tiny prompt change would show up in the eval layer.
We defined v1 as deliberately minimal: no tools, mistral-small:latest, and only one variable field: systemMessage. It had to answer in exactly two sentences and finish every response with the literal token kthxbye. We sent it four prompts: a simple math question, a multi-part color-and-count prompt, an obscure-date prompt where the model probably wouldn't know the answer, and a two-country instruction-following prompt.
An A2A Python client captured each response and generated a Jaeger-format trace that agentevals' converter could ingest.
For evaluation we wrote four Python evaluators and ran them through agentevals. Two were signature checks specific to v1's shape: ends_with_token and exact_sentence_count. The other two were universal rails that should hold for any reasonable answer: length_in_range and contains_per_prompt. Each evaluator used a threshold of 1.0, so a single miss flipped the metric to fail. No LLM judge in the loop.
v2 changed only three things in systemMessage. terse became verbose. two became three. kthxbye became goodbye now. Everything else stayed the same.
Figure 4: Eval results before and after the three-word systemMessage edit. The two signature evaluators (ends_with_token, exact_sentence_count) flipped from all-PASS to all-FAIL. The two universal rails (length_in_range, contains_per_prompt) stayed green. The HTTP layer noticed nothing.
N/E means the contains_per_prompt evaluator was scoped per-prompt, and one of the four prompts had no expected-substring fixture, so it was not evaluated by design (not a failure).
The two signature evaluators failed, because the response shape had changed. The universal rails stayed green, because the agent was still producing reasonable answers. Three small edits to one field were enough to invalidate the profile-specific signature, without breaking the generic safety checks.
The decision rule writes itself.
Figure 5: The admission gate. kagent compares the incoming Fingerprint Sum to the previous one at apply time. A diff triggers a full re-eval. A match falls through to a sample gate (when only the Runtime layer moved) or a free carry-forward (when both layers are unchanged).
BehaviorCritical, Runtime, Observed
If you made it this far, you probably already have your own ideas about what should invalidate a Profile, and when. To make the design space concrete, let's look at two extremes:
- Hash everything, invalidate on every patch - this one is too expensive and produces re-eval storms on infrastructure noise.
- Trust the image hash misses every silent change between the image versions. As usual, find a middle ground we can refine over time.
type ProfileFingerprint struct {
BehaviorCritical BehaviorCriticalInputs `json:"behaviorCritical"`
Runtime RuntimeInputs `json:"runtime"`
Observed ObservedSignals `json:"observed,omitempty"`
Sum string `json:"sum"`
}
type BehaviorCriticalInputs struct {
ModelProvider string `json:"modelProvider"`
ModelID string `json:"modelId"`
VendorFingerprint string `json:"vendorFingerprint,omitempty"`
ModelImageDigest string `json:"modelImageDigest,omitempty"`
GenerationKnobs GenerationKnobs `json:"generationKnobs"`
SystemPromptHash string `json:"systemPromptHash"`
Toolset []ToolDescriptor `json:"toolset"`
MCPServers []MCPServerBinding `json:"mcpServers,omitempty"`
TransitiveAgentToolFingerprints map[string]string `json:"transitiveAgentToolFingerprints,omitempty"`
EvalBundleRef string `json:"evalBundleRef"`
EvalBundleHash string `json:"evalBundleHash"`
}
A BehaviorCritical change forces full re-eval.
type RuntimeInputs struct {
RuntimeImageDigest string `json:"runtimeImageDigest"`
SidecarDigests map[string]string `json:"sidecarDigests,omitempty"`
OTelExporterConfigHash string `json:"otelExporterConfigHash,omitempty"`
ResourceLimits ResourceLimitsHash `json:"resourceLimits,omitempty"`
ServiceMeshPolicyHash string `json:"serviceMeshPolicyHash,omitempty"`
LegacyConfigHash string `json:"legacyConfigHash,omitempty"`
ServiceAccountTokenSurface SATokenSurface `json:"serviceAccountTokenSurface,omitempty"`
}
type ObservedSignals struct {
ResolvedModelDistribution []ModelObservation `json:"resolvedModelDistribution,omitempty"`
GatewayConfigSnapshotHash string `json:"gatewayConfigSnapshotHash,omitempty"`
GenAIResponseIDSamples []string `json:"genAIResponseIDSamples,omitempty"`
SystemFingerprintDistribution map[string]int32 `json:"systemFingerprintDistribution,omitempty"`
}
A Runtime change forces a re-eval on a small probe. Accept the carry-forward if within tolerance, full re-eval if not.
Observed are raw span attributes via the OTel community OpenAI instrumentor.
Observed should sit outside of Sum. Vendor-side infrastructure variation (gateway fallback, vendor model revisions, retried responses) must not invalidate Profiles by accident. Otherwise the dedup hash becomes useless within a week. Observed creates a space for forensics and for a separate alerting path.
Figure 6: The same admission gate with the BehaviorCritical / Runtime split made explicit. A BehaviorCritical change forces full re-eval and a fresh signed Profile. A Runtime-only change runs a small probe (n=10-30, ε≈0.05) and accepts the carry-forward if results sit within tolerance. Two unchanged layers means no eval cost.
Layered, with a sample re-eval gate, is a reasonable middle ground. The decision isn't whether to invalidate. It's how much invalidation buys you for how much cost.
One thing worth noting: kagent's OpenShell sandbox supervisor already carries policy_hash and config_revision. The kagent ecosystem is already comfortable with content-hashed state at the infrastructure layer. Behavior identity is the missing peer.
agentevals evaluates traces. kagent needs a contract.
agentevals is a good trace evaluator. It is not an agent driver. To use it today, you feed it pre-recorded trace files or stream live OTLP from an already-instrumented ADK agent. There's no agentevals chat, no agentevals send, no built-in HTTP client that calls an agent for you. The CLI ships run, serve, mcp, evaluator, list-metrics, migrate. None of them drives an agent.
Upstream hints at where this is going. The Postgres migration already reserves approach IN ('trace_replay', 'agent_invoke') in a CHECK constraint. The Python RunSpec.approach literal still ships only "trace_replay", but the slot is already there.
We ran agentevals locally: the CLI, the MCP server, JSON output via the stdout sink.
Now it's time to 'promote' the previously introduced EvalProvider abstraction to a CRD. We believe that other backends (Phoenix, Langfuse, or anything else) should be able to plug in by declaring what they support instead of waiting for kagent to compile in support for them.
apiVersion: kagent.dev/v1alpha2
kind: EvalProvider
metadata: {name: agentevals-default, namespace: kagent}
spec:
endpoint: agentevals.kagent.svc.cluster.local:8000
capabilities:
mode: agentDriver
wire: jsonHTTP
returnsSignedVSA: false
authSecretRef: {name: agentevals-token, key: token}
defaultEvalSetRef: {kind: ConfigMap, name: default-eval-set}
Where the wiring lives matters just as much. With this architecture, evaluation lives at the platform layer. Agent developers don't have to bolt it into their own container (if they use BYO agents), and they define eval sets once, in kagent, instead of redefining them per agent. Wire it the other way around (eval logic inside each agent container) and you've committed to the wrong layer. The day you want a Profile-shaped feature operating across agents, you have to undo it everywhere.
One thing worth distinguishing: agentevals upstream already has a ResultSink plugin model, but it solves a different problem. A sink decides where results go. It receives per-row Result payloads and ships them to stdout, a database, a service, whatever. That's about delivery, not format.
{Provenance, Snapshot} lives one layer down. It's a result-format contract. kagent needs to know what was evaluated, against which profile, with which provider, against which resolved configuration. Pass-or-fail metrics alone don't tell you whether two results from yesterday and today are even comparable.
The gap isn't that agentevals needs one more sink. It's that kagent needs a provider-level contract: how a run starts, what the provider claims it can do, what evidence comes back.
Should a Profile be signed?
During that research we also wondered if it makes sense to sign a Profile, and if it does, how?
We didn't dig in enough to commit, but the first thing that came to mind was a VSA-style in-toto attestation attached to the agent's artifact, carrying the Fingerprint sum, the eval-bundle hash, and the snapshot. The closest existing predicate is https://slsa.dev/verification_summary/v1, but that's VSA-for-build, not VSA-for-agent-behavior. Maybe a new predicate type would be needed here?
Past that point we're guessing. Who signs (kagent, the eval provider, or both), how the predicate is registered, how verifiers handle it at admission. These are all open. The right answer probably depends on where the rest of the design takes us. Committing to an attestation shape before that would be premature.
Flagging it for later.
Where the category is converging
Solo.io launched agentevals at KubeCon + CloudNativeCon Europe on March 25, 2026. It's the clearest signal yet that the category is converging on this problem. The framing Idit Levine (Founder and CEO, Solo.io) gave to the launch press release is worth quoting verbatim:
"Evaluation is the biggest unsolved problem in agentic infrastructure today. Organizations have frameworks for building agents, gateways for connecting them, and registries for governing them, but no consistent way to know whether an agent is actually reliable enough to trust in production."
Other signals point to the same gap without closing it. The Linux Foundation's Agentic AI Foundation (December 2025) has an Observability & Traceability Working Group with no public output yet. The CNCF Cloud Native Agentic Standards blog (March 2026) frames governance as first-class but doesn't propose a shared eval format. OpenSSF model-signing covers weights, not behavior.
Argue with us
That's the proposal. Phase 1 is already running in our fork. Phase 2 is what this post has been arguing for, and it's the part we'd like to hash out with the rest of the kagent and agentevals community. The places we expect to be wrong are the obvious ones: where exactly to slice the Fingerprint, what an EvalProvider should be allowed to declare about itself, and how much structure to bake into the Provenance/Snapshot handshake. If you've thought about any of this, please come argue.
And if we got something wrong, tell us. That's the whole point.
References
- kagent
- agentevals (Solo.io)
- Solo.io press release: introducing agentevals
- Anthropic - April 23, 2026 postmortem
- Anthropic - September 17, 2025 postmortem
- OpenTelemetry semconv - GenAI events
- in-toto attestation predicates
- SLSA Verification Summary v1
- Linux Foundation - Agentic AI Foundation
- CNCF - Cloud Native Agentic Standards