Mert Cemri, Melissa Pan, Ion Stoica and the MAST Team (UC Berkeley)

Saurabh Jha, Rohan Arora, Daby Sow, Nicholas Fuller (IBM Research)

🗓️ Posted: December 19, 2025

<aside> 🤖

Agentic LLMs are increasingly adopted in real world IT tasks, for tasks like triaging incidents, querying logs/metrics and generating Kubernetes actions. However, evaluating these agentic systems is hard. Existing benchmarks, such as IT-Bench, typically provide just a single number (e.g., success rate) which is insufficient to understand where these systems fail and how to fix them. In this post, we aim to alleviate this challenge by using MAST (Multi-Agent System Failure Taxonomy) to turn ITBench execution traces from SRE scenarios into structured failure signatures that not only show whether a run fails, but can also explain how and why the run failed, thus providing insights into how to fix it.


“All successful task are alike; every unsuccessful task is failing in its own way.” (Berkeley ‘25 — after “Anna Karenina”, Lev Tolstoy)

The "Black Box" Problem of Agent Benchmarks

Benchmarks like ITBench are becoming the standard for measuring agentic performance in high-stakes IT automation tasks. In ITBench, agents act as Site Reliability Engineers (SREs) or Security Analysts tasked with diagnosing Kubernetes outages, patching vulnerabilities, or managing cloud costs in production environments.

This benchmarks use success rate as a main metric to evaluate agents. However, this metric is insufficient for engineering robust systems. Knowing that an agentic system achieves a 14% success rate on ITBench tells us that it failed, but not why: Did it fail because it forgot the context? Because it hallucinated a command? Or because it simply did not terminate?

Without a comprehensive approach to diagnose these failures, developers are left guessing, often resorting to blind prompting tweaks that solve one problem only to create another.

MAST: A Diagnostic Tool for Agents

image.png

As a new standard to analyze the failure modes of complex agentic systems, we developed MAST (Multi-Agent System Failure Taxonomy). MAST brings more insights and open up the opaque evaluation of these benchmarks. Derived from a rigorous analysis of over 1,600 traces across seven different frameworks, MAST provides a standardized taxonomy for agent failures.

MAST converts unstructured execution logs into structured "failure vectors" based on 14 distinct patterns across three key categories:

image.png


The Experiment: Diagnosing ITBench Agents