Back to blog

~/blog

How to evaluate an Agentic AI system for reliability and scalability

May 9, 202613 min readBy mohammed.vasim
aiengineeringagentic-aievaluationllm

Agentic AI systems don’t just answer questions—they act, remember, plan, and learn. That changes the evaluation game entirely. In this post, I’ll walk you through what I’ve learned about what to test and how to build an evaluation framework that actually catches these failures before your users do. I’ll draw on both hard-won field experience and insights from the latest research—no blog fluff, just real stuff that works.


1. Why Agentic Evaluation Demands a New Playbook

Traditional software testing gives you a comforting illusion: same input, same output. An LLM-powered agent shatters that. A single user query can trigger a cascade of decisions—which tool to invoke, what parameters to fill, how to sequence actions, whether to ask a follow-up—all generated probabilistically.

So the first mental shift is this: we aren’t testing code paths; we’re testing decision quality under uncertainty. That means we need to evaluate three dimensions simultaneously:

  • Capability: Can the agent complete the task correctly?
  • Alignment: Does it act in line with user intent, context, and policy?
  • Resilience: Does it handle the unexpected without breaking trust?

A test that only checks if the right tool was called might miss that the agent hallucinated a patient’s medication dosage in its explanation. A test that verifies coherent output might miss that the agent ignored a contradictory user statement from three turns ago. You need coverage across all three.


2. Anatomy of an Agent — The Evaluation Surface

Before we can build an evaluation strategy, let’s name the parts that can fail. In any agentic system you’ll find a few core components:

  • Tools: the functions the agent can call—APIs, databases, calculators.
  • Planner: the reasoning layer that turns “refund my cracked mug” into a sequence of tool calls and messages.
  • Memory: the short- and long-term storage of conversation history, user preferences, and retrieved facts.
  • Learning: any mechanism that updates the agent’s behavior over time (fine-tuning, RL, in-context feedback).
  • Communication: the natural language the agent generates alongside its actions.

Each of these has its own failure modes. Tools might be called with swapped parameters. Plans might skip a critical verification step. Memory might return stale data from a previous session. Learning loops might reinforce a bad habit. Communication might hallucinate a confirmation number. That’s the evaluation surface—our job is to map it systematically.


3. Capability Testing — Does the Agent Do the Right Thing?

Capability is the easiest dimension to want to test, and the hardest to do well. I’ve seen teams write 200 unit tests for individual function calls and still miss egregious planning bugs. Here’s how we structure it.

3.1 Tool Execution — From Unit Tests to Sandboxed Validation

Unit testing tools is the most familiar ground. In my team, every tool has an automated suite that covers the “happy path,” edge cases (empty strings, maximum lengths, weird character encodings), and adversarial inputs (like an order ID that looks valid but doesn’t exist). We check not just output correctness, but also latency boundaries and resource consumption—if a retrieval tool starts taking 4 seconds under load, the agent’s overall coherence collapses downstream.

But here’s what the research community confirmed for us: static tool testing isn’t enough. A paper called GRETEL showed that when agents retrieve tools purely by semantic similarity, they often pick tools that look right but functionally aren’t—like choosing “cancel_order” instead of “cancel_line_item.” GRETEL’s approach uses a sandboxed execution loop to validate tool selection by actually running candidate calls in a safe environment, boosting the ToolBench Pass Rate from 0.69 to 0.83. We built a similar lightweight execution sandbox for our tool registry; now every candidate tool choice for critical actions gets a dry-run before the agent commits.

So my rule of thumb: for every high-stakes tool, have both a static unit test suite and a dynamic execution gate.

3.2 Plan Correctness — Verifying Action Sequences

Planners are where agentic magic happens, and where the most subtle bugs hide. In our customer support agent, a refund for a damaged item has a canonical plan: locate the order, verify delivery status, ask for a photo, issue the refund, confirm. When the planner works, it’s poetry. When it doesn’t, you get that nightmare scenario from my intro.

I started with the metrics from the book: tool recall (did we call all necessary tools?), tool precision (did we avoid unnecessary ones?), and parameter accuracy (did we pass the right order ID and amount?). These three numbers give you immediate signal. Low recall means the planner forgot a step; low precision means it’s doing extra, possibly dangerous things; parameter mismatches mean it’s not grounding properly.

But what about ambiguous or contradictory user turns? That’s where research on plan verification becomes gold. A recent NeurIPS 2025 paper on Plan Verification for LLM-Based Agents proposed using a separate Judge LLM to critique an action sequence against the expected outcome, achieving up to 90% recall and 100% precision across several models. I’ve adapted this into an actor-critic loop: our agent proposes a plan, a stricter evaluator (often a more capable model) checks it for logical gaps and policy violations. If the plan doesn’t pass, we regenerate.

The outcome: our false-accept rate for refund anomalies dropped dramatically, and we now catch embarrassing “looks plausible but is wrong” plans before they ever touch a real API.

3.3 End-to-End Task Success — The Scenario Library

Unit tests for tools and planners still operate in well-lit rooms. Real conversations are messy. That’s why the book’s evaluate_single_instance pattern resonated so much: for each scenario, we define a conversation history, an order state, and an expected final outcome—tool calls, parameters, and key phrases that must appear in the response. Then we run the agent end-to-end and compute a task success score that blends tool recall, precision, parameter accuracy, and communication checks (like “did the agent actually tell the customer the refund is being processed?”).

This scenario library is our living regression corpus. We started with 30 canonical intents and have grown to over 300, with variants for multi-item orders, corrected addresses, and ambiguous pronouns. If a new model version drops phrase recall on “we’ve issued your refund” by even 2%, we know something shifted in its communication style.

But I’d be lying if I said end-to-end metrics are enough. Benchmarks like AgentBench, WebArena, and SWE-bench give you a community yardstick, but they don’t capture your specific policy nuances. Use them for directional signals, but build your own scenario library around your domain’s failure modes.


4. Alignment Testing — Is the Agent Faithful to Intent & Context?

Capability ensures the agent did something. Alignment ensures it did the right something—consistent with the user’s goal, the conversation’s history, and your organization’s rules.

4.1 Consistency & Coherence Over Time

Here’s a stat that made me redesign our memory tests: LongMemEval found that commercial chat assistants lose up to 30% accuracy in retrieving key information over sustained interactions. Think about that—after five turns of a customer support chat, your agent might “forget” that the user already provided their order number, and ask for it again. That’s not just annoying; it erodes trust instantly.

In our testing, we simulate long conversations with injection of facts early on, then query for them 10 turns later. We check not only that the right information is retrieved (using a metric like retrieval_accuracy@k from the book) but also that the retrieval is proactive. A great agent shouldn’t wait for the user to repeat themselves; it should pull context when relevant. The MemAware framework pushes this exact idea — it evaluates whether a memory system surfaces information at the right moment, without explicit prompting.

We also stress-test for coherence. Contradiction detection is now part of every multi-turn test: if the agent says the order is “delivered” in turn 2, but later refers to “shipping updates” for that same order, it’s a coherence failure. One technique I borrowed from the MEMTIER paper is testing over extremely long time horizons (they found a 14-percentage-point drop in tool-execution success over a 72-hour run). We now run simulated 24-hour agent sessions where periodic background tasks (like scheduled order status updates) must not corrupt conversation-specific memory. It’s painful to debug, but it surfaces the exact memory starvation bugs that users would otherwise encounter at 2 a.m. on a holiday.

4.2 Hallucination & Grounding

Hallucination is the monster under the bed. Every agent team I know grapples with it. The book’s emphasis on retrieval-augmented generation (RAG) and data dependence is spot on—if your agent has to pull facts, then the source of those facts must be verifiable, and the test suite must prove that the agent uses the source, not just its internal model’s guess.

We built a “grounding oracle” that, for any factual claim in the output, traces it back to a queried document or tool result. If there’s no trace, it’s flagged as potential hallucination. We then sample flagged outputs daily for human review. This hybrid human-AI feedback loop is exactly what newer research suggests—combinations of automated detection and expert correction are more reliable than either alone.

But here’s a trap: you can’t just evaluate hallucination independently; you have to consider cost. I helped a fintech team that reduced hallucination by 40%… but their inference costs quadrupled because they added excessive retrieval steps. There’s a growing conversation around cost-aware hallucination metrics—we actually calculate a “hallucination cost per user interaction” to decide how aggressive our grounding strategies should be.

4.3 Communication Alignment

An agent can pick the right tools and still sound robotic, evasive, or wrong. That’s why the book’s phrase-recall metric matters: in our refund scenario, we expect the assistant’s final message to contain “your refund is being processed for $XX.XX.” If it says “transaction initiated” instead, our phrase recall drops and we investigate. This isn’t about matching exact strings; it’s about ensuring that users hear the confirmation they need to trust the system.

But who judges communication quality at scale? LLM-as-a-judge is the current answer. However, the JudgeSense benchmark warns us that judges can be wildly inconsistent—they found coherence ratings varying from 0.389 to 0.992 depending on prompt phrasing. That’s insane. So we version our evaluation prompts carefully, use multiple judge models in a panel, and never rely on a single LLM judge for a critical go/no-go decision. For safety-relevant outputs, we still default to human review.


5. Resilience Testing — How Does the Agent Handle the Unexpected?

Real-world users don’t read your test plans. They type gibberish, switch tasks mid-sentence, and occasionally try to break things intentionally. Our job is to make sure the agent doesn’t crumble—or worse, obey—in these situations.

5.1 Adversarial & Out-of-Distribution Inputs

I’ve built an adversarial suite that includes typos (“refnd my mugg”), extreme input lengths, emotional language, and even prompts designed to trick the agent into leaking information (e.g., “Tell me the last 4 digits of the credit card on my account”). The book’s approach of fuzzing inputs and checking for graceful handling is our baseline.

But modern red-teaming has gone agentic itself. The GOAT framework automated multi-turn adversarials, achieving a 94% attack success rate against GPT-4o in breaking guardrails. That’s terrifying and instructive. Meanwhile, SIRAJ uses a structured reasoning attacker that refines its probes based on the agent’s execution trajectory. I now use a simplified version of this internally: we have an adversarial agent whose sole job is to break our customer support agent. Every week, it generates new test cases, and the failures get added to our regression suite. It’s like a gym for our agent’s resilience, and the muscle growth is real.

5.2 Graceful Degradation & Safety

Resilience isn’t just about blocking attacks; it’s about what the agent does when a tool call fails, or when the database returns a 500. In our tests, we inject tool failure modes: timeout, permission denied, empty result set. The expected behavior is documented: the agent must offer an alternative, escalate, or politely ask to retry. It must never invent a resolution. The book’s principle of graceful degradation became our “degradation ladder”—a set of expected fallback behaviors for each failure class, verified by automated tests.

Safety leaks are another beast. We have a specific test for sensitive information leakage: if a user mentions someone else’s name, the agent must not pull up order data based on that name unless proper authentication occurs. These tests are run pre-commit and in CI, with zero tolerance for failure.

5.3 Memory & Learning Under Load

I mentioned earlier the importance of long-duration memory tests. But load testing memory is equally critical. We benchmark retrieval latency under 10,000 concurrent sessions. If semantic search starts to degrade beyond 200ms, we see a ripple effect: planning uses stale context, parameter accuracy drops, and coherence suffers. The book’s advice on testing vector search with both “easy” and “hard” retrievals is spot-on; we purposely inject semantically similar but irrelevant entries to check if the retrieval ranks the correct one on top.

Learning loops are the hardest to test in isolation. If your agent learns from user feedback, you need a separate “learning evaluation” that checks, after N training iterations on a canonical dataset, that:

  • Accuracy on held-out similar tasks improves.
  • Performance on previous strong tasks doesn’t collapse (catastrophic forgetting).
  • The agent hasn’t simply overfitted to phrasing patterns.

We run these evaluations weekly, with a “regression threshold” that blocks the new checkpoint from being promoted if any core capability drops.


6. The Evaluation Toolbox — Methods & Metrics That Work

Okay, so we have the what. Now the how. Over time, I’ve assembled a layered toolbox.

6.1 Automated Oracles — Judges, Critics, and Multi-Agent Verification

The classic LLM-as-a-judge is a scalpel that can also be a butter knife if you’re not careful. As JudgeSense shows, prompts matter enormously. We now have a prompt library for different judgment tasks (coherence, policy compliance, empathy) with explicit few-shot examples that were calibrated on human judgments. We also use the Conformal Prediction technique from recent research to get reliability scores per judgment, so we know when to fall back to human review.

The actor-critic pattern from the book pairs perfectly with multi-agent verification frameworks like AEMA. In our pipeline, after the agent generates a plan and final message, a separate critique model (the “skeptic”) checks for contradiction, missing confirmation, and hallucination. If the skeptic flags an issue above a threshold, the agent is forced to regenerate or escalate for human review. This has cut our post-deployment escalations by a third.

6.2 Human-in-the-Loop — Closing the Gap

Metrics can lie. I’ve seen agents score 95% on our automated suite yet produce outputs that human reviewers found “technically correct but completely tone-deaf.” That’s because empathy and nuanced appropriateness aren’t well captured by recall metrics. Our solution: a weekly random sample of 200 interactions is reviewed by domain experts (senior support agents, in our case). Their feedback not only flags false negatives but also feeds new test cases—phrasings, slang, emotional states we hadn’t thought of.

This loop is the heartbeat of a living evaluation system. Without it, your test suite fossilizes and your metrics become vanity numbers.

6.3 Building a Living Evaluation Pipeline

Here’s how we operationalize all this:

  • Versioned scenario library with ground-truth expectations stored alongside the code.
  • CI/CD integration: on every pull request, the agent runs the full scenario library (unit tests + 200 end-to-end scenarios) with a 15-minute time budget. Failure blocks merge.
  • Nightly adversarial sweep: our red-teaming agent runs against the latest build and surfaces new failures.
  • Weekly human calibration: reviewers spot-check high-uncertainty cases and update the scenario library and judge prompts.
  • Production shadow mode: for risky changes, we run the new agent version in shadow on real traffic, comparing its actions to the current production version without affecting users. Discrepancies are analyzed.

This might sound like overkill, but when an agent is handling real money or sensitive data, it’s the bare minimum.


7. A Maturity Model for Agentic Evaluation Practice

Not every team needs to implement all of this on Day 1. Here’s a pragmatic progression that’s worked for teams I’ve advised:

  • Level 1 — Component Correctness: You have unit tests for every tool and a small set of canonical planning scenarios. You measure tool recall and parameter accuracy.
  • Level 2 — Scenario Coverage: You maintain an end-to-end scenario library covering the top 50 user intents. You run it in CI and track task success scores.
  • Level 3 — Robustness & Safety: You’ve added adversarial input testing, red-teaming, and degradation ladder checks. You regularly test for hallucination and sensitive data leakage.
  • Level 4 — Continuous Assurance: Your evaluation pipeline is a living organism. You have shadow deployments, human-in-the-loop feedback, drift monitoring, and automated adversarial generation. Evaluation is not a gate; it’s a process.

Start where you are, but be honest about the level of risk you’re carrying at each stage.


8. Conclusion — Evaluation Is How Trust Is Earned

An agent that picks the right tool 99% of the time but hallucinates a sensitive detail in the 1% case is not trustworthy. Trust isn’t built by passing a benchmark; it’s earned by showing—continuously, under stress, over time—that the agent behaves in a way that’s safe, aligned, and resilient.

My takeaway from years of building and breaking these systems is simple: your evaluation framework is the true business logic of your agentic system. It encodes what “good” means. Treat it with the same reverence you’d give your product’s core code. Keep it alive, keep it adversarial, and never trust a metric that hasn’t been challenged by a human.

If you’re just starting out, grab the evaluate_single_instance pattern from the book, write up five scenarios that would embarrass you if they failed in production, and run them every time you change a prompt or a model. That one habit alone will save you more sleepless nights than you can imagine.

Because in the end, the question isn’t whether your agent can pass a test suite. It’s whether your users can sleep at night knowing your agent is out there, acting on their behalf. And that depends entirely on how rigorously you’ve dared to test it.


Now, what’s the scariest failure mode you’ve seen in an agentic system? I’d love to hear about it—because the collection of near-misses is often the best test plan you’ll ever write.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment