Related papers: AgentRx: Diagnosing AI Agent Failures from Execution Trajectories

AgentRx: Diagnosing AI Agent Failures from Execution Trajectories

URL: http://arxiv.org/abs/2602.02475v1
Date: Mon, 02 Feb 2026 18:54:07 GMT
Title: AgentRx: Diagnosing AI Agent Failures from Execution Trajectories
Authors: Shraddha Barke, Arnav Goyal, Alind Khare, Avaljot Singh, Suman Nath, Chetan Bansal,
Abstract summary: We release a benchmark of 115 failed trajectories spanning structured API, incident management, and open-ended web/file tasks.<n>Each trajectory is annotated with a critical failure step and a category from a grounded-theory derived, cross-domain failure taxonomy.<n>We present AGENTRX, an automated domain-agnostic diagnostic framework that pinpoints the critical failure step in a failed agent trajectory.
Score: 9.61742219198197
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AI agents often fail in ways that are difficult to localize because executions are probabilistic, long-horizon, multi-agent, and mediated by noisy tool outputs. We address this gap by manually annotating failed agent runs and release a novel benchmark of 115 failed trajectories spanning structured API workflows, incident management, and open-ended web/file tasks. Each trajectory is annotated with a critical failure step and a category from a grounded-theory derived, cross-domain failure taxonomy. To mitigate the human cost of failure attribution, we present AGENTRX, an automated domain-agnostic diagnostic framework that pinpoints the critical failure step in a failed agent trajectory. It synthesizes constraints, evaluates them step-by-step, and produces an auditable validation log of constraint violations with associated evidence; an LLM-based judge uses this log to localize the critical step and category. Our framework improves step localization and failure attribution over existing baselines across three domains.

Related papers

What Makes a Good LLM Agent for Real-world Penetration Testing? [37.56537537883771]
We analyze 28 LLM-based penetration testing systems and evaluate five representative implementations across three benchmarks of increasing complexity.<n>We show that Type B failures share a root cause that is largely invariant to the underlying LLM: agents lack real-time task difficulty estimation.<n>We present Excalibur, a penetration testing agent that couples strong tooling with difficulty-aware planning.
arXiv Detail & Related papers (2026-02-19T18:42:40Z)
Execution-State-Aware LLM Reasoning for Automated Proof-of-Vulnerability Generation [36.950993500170014]
We present DrillAgent, an agentic framework that reformulates PoV generation as an iterative hypothesis-verification-refinement process.<n>We evaluate DrillAgent on SEC-bench, a large-scale benchmark of real-world C/C++ vulnerabilities.
arXiv Detail & Related papers (2026-02-14T03:17:27Z)
The Bitter Lesson of Diffusion Language Models for Agentic Workflows: A Comprehensive Reality Check [54.08619694620588]
We present a comprehensive evaluation of dLLMs across two distinct agentic paradigms: Embodied Agents and Tool-Calling Agents.<n>Our results on Agentboard and BFCL reveal a "bitter lesson": current dLLMs fail to serve as reliable agentic backbones.
arXiv Detail & Related papers (2026-01-19T11:45:39Z)
AgentDevel: Reframing Self-Evolving LLM Agents as Release Engineering [8.201374511929538]
AgentDevel is a release engineering pipeline that iteratively runs the current agent.<n>It produces implementation-blind, symptom-level quality signals from execution traces.<n>It aggregates dominant symptom patterns and produces auditable engineering specifications.
arXiv Detail & Related papers (2026-01-08T05:49:01Z)
DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems [48.971606069204825]
DoVer is an intervention-driven debug framework for large language model (LLM)-based multi-agent systems.<n>It augments hypothesis generation with active verification through targeted interventions.<n>DoVer flips 18-28% of failed trials into successes, achieves up to 16% milestone progress, and validates or refutes 30-60% of failure hypotheses.
arXiv Detail & Related papers (2025-12-07T09:23:48Z)
Metacognitive Self-Correction for Multi-Agent System via Prototype-Guided Next-Execution Reconstruction [58.51530390018909]
Large Language Model based multi-agent systems excel at collaborative problem solving but remain brittle to cascading errors.<n>We present MASC, a metacognitive framework that endows MAS with real-time, unsupervised, step-level error detection and self-correction.
arXiv Detail & Related papers (2025-10-16T05:35:37Z)
Where LLM Agents Fail and How They can Learn From Failures [62.196870049524364]
Large Language Model (LLM) agents have shown promise in solving complex, multi-step tasks.<n>They amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions.<n>Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way.<n>We introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system-level operations.
arXiv Detail & Related papers (2025-09-29T18:20:27Z)
Detecting Pipeline Failures through Fine-Grained Analysis of Web Agents [0.48156730450374763]
This work analyzes existing benchmarks and highlights the lack of fine-grained diagnostic tools.<n>We propose a modular evaluation framework that decomposes agent pipelines into interpretable stages for detailed error analysis.
arXiv Detail & Related papers (2025-09-17T19:34:49Z)
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems [50.29939179830491]
Failure attribution in LLM multi-agent systems remains underexplored and labor-intensive.<n>We develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons.<n>The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps.
arXiv Detail & Related papers (2025-04-30T23:09:44Z)
SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints [59.645885492637845]
SOPBench is an evaluation pipeline that transforms each service-specific SOP code program into a directed graph of executable functions.<n>Our approach transforms each service-specific SOP code program into a directed graph of executable functions and requires agents to call these functions based on natural language SOP descriptions.<n>We evaluate 18 leading models, and results show the task is challenging even for top-tier models.
arXiv Detail & Related papers (2025-03-11T17:53:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.