Part 1: How to evaluate agents with grounded theory
A practical guide to using grounded theory to sample traces, identify failure modes, and turn qualitative judgment into scalable evaluation.
This is part one of a series of articles focused on evaluations and how to build agentic workflows that also hold up in production.
Evaluating agentic workflows is hard in precisely the ways that matter: the failure modes are often subtle, context-dependent, and not especially friendly to neat benchmarks. In this post I will walk through the methodology I use to design evaluations around grounded theory. The approach is best suited to workflows where the thing you care about is a subjective or qualitative judgment, not just a tidy numeric outcome.
Foreword
Before I get into the methodology, there are two things worth stating plainly.
- Subject matter experts are not optional: If you think you can design a workflow or integration without consulting the people who understand the domain, you are setting yourself up to fail. Bring them in early, use them aggressively, and do not treat them as a rubber stamp. If your evaluation does not reflect how the real domain works, it is already broken.
- Evaluations are multi-faceted: This method is aimed at the qualitative aspects of agentic output that are hard to quantify cleanly. That does not make programmatic evaluation less useful. It just means the two need to work together. Use code-based checks where they are strong, and use grounded theory where the behaviour is messy, contextual, and resistant to simple rules.
Methodology
The approach I apply to evaluation is rooted in mixed-methods grounded theory — specifically, an Exploratory Sequential design. If you want to go deep on the theory, Corbin & Strauss is the place to start. In practice, it means: sample, explore, and iterate until a stable theory emerges from your traces. That theory then drives the construction of a quantitative scorer — not the other way around. I prefer that order because it keeps the metric honest; you are operationalising what the traces actually say, not forcing the traces to obey a metric you chose in advance.
The qualitative phase — explore the traces
The goal of this phase is not to measure anything. It is to understand what is actually happening when your agent runs, and to give that behaviour a vocabulary that is precise enough to reason about later. This phase consists of an iterative process with four steps.
Sample
Start by collecting interaction traces across varied tasks, users, and contexts. This is a purposive sample, not a random one — chosen for diversity of experience. If you have access to real users at this stage, use them. It is fine if the agent breaks; it should break. Resist the urge to optimize anything. At this stage you want breadth, not depth.
Explore
Once you have a sample, begin open coding. Read the traces qualitatively, one by one, and assign emergent labels as you go — not from a prior codebook, but from what you observe. Labels like “spurious confidence”, “context drop”, or “sycophantic reversal” come from the data, not from you.
This process follows the paradigm of open coding, axial coding, and selective coding — each pass deepens and organises the category structure.
Iterate
With an initial set of categories in hand, go back to sampling — but now with a different purpose. Theoretical sampling means deliberately seeking traces that challenge your emerging theory: cases where the agent performs well when you expected failure, or fails in ways your current categories do not capture. Compare each new trace against all prior ones. Refine, split, merge, or discard categories as needed. The point is not to defend the original taxonomy; the point is to make the taxonomy survive contact with more data.
Stop — theoretical saturation
Keep sampling until new traces stop producing new categories. When your taxonomy holds across new data without needing revision, you have reached theoretical saturation. At that point the qualitative phase is complete, because the model of failure is no longer changing in any meaningful way.
The emergent theory
As the qualitative phase moves along, you will end up with what is usually called an emergent theory. In practical terms, that is your codebook, the relationships between the codes, and the categories that keep surviving iteration. This is the failure taxonomy you actually trust, and it becomes the basis for everything that follows.
The quantitative phase
You have now identified your failure modes, understand why and how they occur, and have reached saturation. At this point your annotated traces are already useful on their own — you can study frequency, distribution, and co-occurrence patterns across categories. The qualitative phase gives you structure; the quantitative phase gives you scale. If you want evaluation to become a repeatable practice rather than a one-off research exercise, you need to turn the theory into automated scorers.
Operationalize
You now have a grounded failure taxonomy and a set of annotated traces. The next step is to turn that taxonomy into judges: LLM-based scorers that can assess new traces without human annotation.
For each qualitative category, write a classification prompt that describes the category precisely: what it looks like in practice, what conditions trigger it, and what distinguishes it from adjacent failure modes. Because the category came from the data, the prompt is grounded in reality — not in your assumptions about how the agent might fail.
Now begins the inner training loop: use your annotated traces to tune and evaluate the judge prompts themselves. Run each judge against your holdout set and calculate accuracy and F1 score against the human labels. This tells you how well the judge tracks human judgment. If accuracy is low, the judge prompt is misaligned with the category; revise and re-test. Repeat until the judge performance plateaus.
Validate
Once your judges are stable, the outer training loop begins: use them to continuously evaluate and improve your agent. As you collect new traces from production, run them through your judges and analyze the results. Look for patterns in failure categories. Those patterns become the basis for agent refinements, whether that is a fix to your prompts, a change to your tool selection, or a new knowledge integration.
The holdout set does not serve a one-time purpose. As you continue to collect traces, periodically re-annotate a sample and re-evaluate your judges against it. This keeps both loops honest: it guards against Goodhart’s Law (the metric drifting away from reality) and catches the moment when new failure modes emerge that your taxonomy does not yet capture.
The result is a framework that is transparent and alive — every scorer maps back to a named, documented qualitative category, and the whole system stays aligned with your actual traces as behavior evolves.
Conclusion
Grounded theory is a practical way to design evaluations for agentic systems when the most important signals are qualitative and contextual. It can feel heavy at first, but the process becomes much more intuitive once you have coded real traces, refined categories, and seen the taxonomy stabilize.
It is also not a one-size-fits-all method, and it does not need to be. In a healthy evaluation stack, simple deterministic checks usually form the outer layer: schema checks, rule-based assertions, and regex-style validations. Grounded-theory-derived judges sit further in, where behavior depends on context, judgment, and nuance. Used together, these layers give you both breadth and depth.
Continued reading and next steps
If you want to read more about evaluations, I can recommend the following sources (in no particular order). In the next post I will discuss tooling, which will deep dive a little bit more into the technical side of evaluations.