Mert Cemri, Melissa Pan, Ion Stoica and the MAST Team (UC Berkeley)
Saurabh Jha, Rohan Arora, Daby Sow, Nicholas Fuller (IBM Research)
🗓️ Posted: December 19, 2025
<aside>
🤖
Agentic LLMs are increasingly adopted in real world IT tasks, for tasks like triaging incidents, querying logs/metrics and generating Kubernetes actions. However, evaluating these agentic systems is hard. Existing benchmarks, such as IT-Bench, typically provide just a single number (e.g., success rate) which is insufficient to understand where these systems fail and how to fix them. In this post, we aim to alleviate this challenge by using MAST (Multi-Agent System Failure Taxonomy) to turn ITBench execution traces from SRE scenarios into structured failure signatures that not only show whether a run fails, but can also explain how and why the run failed, thus providing insights into how to fix it.
- Beyond Accuracy: Success rates on ITBench (SRE, Security, FinOps tasks) only tell you if an agent failed. MAST reveals how it failed.
- The "Isolated" vs. "Cascading" Divide: Our analysis identifies a Failure Complexity Hierarchy. Frontier models like Gemini-3-Flash exhibit "Isolated Failures" (2.6 failure modes/trace), typically failing at a single, discrete bottleneck. In contrast, GPT-OSS-120B suffers from "Cascading Collapse" (5.3 failure modes/trace), where one minor reasoning mismatch triggers a compounding, systemic breakdown.
- Fatal vs. Non-Fatal (Benign) Failure Modes: We separate fatal failure modes such as (i) agents not knowing when to stop or reasoning-action misalignment of agents from (ii) benign and non-fatal failure modes such as messy behavior that can still succeed.
- Takeaways from the analysis:
- Don’t let the LLM “declare success”: Verification failures are near-universal (FM-3.3). Require concrete, external tool evidence (alert cleared, metric thresholds, k8s state) before marking a ticket as being resolved.
- Termination confusion (FM-3.1/1.5) remains a critical bottleneck. The high incidence of termination confusion (FM-3.1/1.5) reveals that agents often lose track of their own macro-state. Rather than asking the LLM to "decide" when it is finished, offload the iterative control flow and exit conditions to a deterministic code-based state machine. This ensures the system infrastructure governs the task's completion based on objective evidence, rather than relying on the model's inconsistent internal state.
- Break the cascade: implement aggressive context hygiene and consistency checks to ensure that intermediate reasoning mismatches do not poison the long-horizon task history.
“All successful task are alike; every unsuccessful task is failing in its own way.” (Berkeley ‘25 — after “Anna Karenina”, Lev Tolstoy)
The "Black Box" Problem of Agent Benchmarks
Benchmarks like ITBench are becoming the standard for measuring agentic performance in high-stakes IT automation tasks. In ITBench, agents act as Site Reliability Engineers (SREs) or Security Analysts tasked with diagnosing Kubernetes outages, patching vulnerabilities, or managing cloud costs in production environments.
This benchmarks use success rate as a main metric to evaluate agents. However, this metric is insufficient for engineering robust systems. Knowing that an agentic system achieves a 14% success rate on ITBench tells us that it failed, but not why: Did it fail because it forgot the context? Because it hallucinated a command? Or because it simply did not terminate?
Without a comprehensive approach to diagnose these failures, developers are left guessing, often resorting to blind prompting tweaks that solve one problem only to create another.
MAST: A Diagnostic Tool for Agents

As a new standard to analyze the failure modes of complex agentic systems, we developed MAST (Multi-Agent System Failure Taxonomy). MAST brings more insights and open up the opaque evaluation of these benchmarks. Derived from a rigorous analysis of over 1,600 traces across seven different frameworks, MAST provides a standardized taxonomy for agent failures.
MAST converts unstructured execution logs into structured "failure vectors" based on 14 distinct patterns across three key categories:
- FC1: System Design Issues (The "Skeleton")
- Failures here stem from the agent's architecture and role definition.
- Examples: FM-1.3 Step Repetition (looping), FM-1.4 Loss of Conversation History (memory leaks), FM-1.5 Unaware of Termination (failing to stop).
- FC2: Inter-Agent Misalignment (The "Communication")
- Failures arising during runtime from how agents talk to each other or the environment.
- Examples: FM-2.2 Fail to Ask for Clarification (assuming instead of asking), FM-2.3 Task Derailment (going off-topic).
- FC3: Task Verification (The "Quality Control")
- Failures in quality assurance of the agents' output.
- Examples: FM-3.1 Premature Termination (giving up too soon), FM-3.3 Incorrect Verification (hallucinating success).

The Experiment: Diagnosing ITBench Agents