AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?
- URL: http://arxiv.org/abs/2509.03312v2
- Date: Thu, 04 Sep 2025 17:49:20 GMT
- Title: AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?
- Authors: Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, Shuicheng Yan,
- Abstract summary: AgenTracer-8B is a lightweight failure tracer trained with multi-granular reinforcement learning.<n>On the Who&When benchmark, AgenTracer-8B outperforms giant proprietary LLMs like Gemini-2.5-Pro and Claude-4-Sonnet by up to 18.18%.<n>AgenTracer-8B delivers actionable feedback to off-the-shelf multi-agent systems like MetaGPT and MaAS with 4.8-14.2% performance gains.
- Score: 71.21547572568655
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Model (LLM)-based agentic systems, often comprising multiple models, complex tool invocations, and orchestration protocols, substantially outperform monolithic agents. Yet this very sophistication amplifies their fragility, making them more prone to system failure. Pinpointing the specific agent or step responsible for an error within long execution traces defines the task of agentic system failure attribution. Current state-of-the-art reasoning LLMs, however, remain strikingly inadequate for this challenge, with accuracy generally below 10%. To address this gap, we propose AgenTracer, the first automated framework for annotating failed multi-agent trajectories via counterfactual replay and programmed fault injection, producing the curated dataset TracerTraj. Leveraging this resource, we develop AgenTracer-8B, a lightweight failure tracer trained with multi-granular reinforcement learning, capable of efficiently diagnosing errors in verbose multi-agent interactions. On the Who&When benchmark, AgenTracer-8B outperforms giant proprietary LLMs like Gemini-2.5-Pro and Claude-4-Sonnet by up to 18.18%, setting a new standard in LLM agentic failure attribution. More importantly, AgenTracer-8B delivers actionable feedback to off-the-shelf multi-agent systems like MetaGPT and MaAS with 4.8-14.2% performance gains, empowering self-correcting and self-evolving agentic AI.
Related papers
- AgentStepper: Interactive Debugging of Software Development Agents [14.265317773238529]
We introduce AgentStepper, the first interactive debugger for software engineering agents.<n>AgentStepper represents trajectories as structured conversations among an LLM, the agent program, and tools.<n>It supports breakpoints, stepwise execution, and live editing of prompts and tool invocations, while capturing and displaying intermediate repository-level code changes.
arXiv Detail & Related papers (2026-02-06T10:44:09Z) - The Bitter Lesson of Diffusion Language Models for Agentic Workflows: A Comprehensive Reality Check [54.08619694620588]
We present a comprehensive evaluation of dLLMs across two distinct agentic paradigms: Embodied Agents and Tool-Calling Agents.<n>Our results on Agentboard and BFCL reveal a "bitter lesson": current dLLMs fail to serve as reliable agentic backbones.
arXiv Detail & Related papers (2026-01-19T11:45:39Z) - DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems [48.971606069204825]
DoVer is an intervention-driven debug framework for large language model (LLM)-based multi-agent systems.<n>It augments hypothesis generation with active verification through targeted interventions.<n>DoVer flips 18-28% of failed trials into successes, achieves up to 16% milestone progress, and validates or refutes 30-60% of failure hypotheses.
arXiv Detail & Related papers (2025-12-07T09:23:48Z) - AgentAsk: Multi-Agent Systems Need to Ask [26.13279490836716]
Multi-agent systems built on large language models (LLMs) promise enhanced problem-solving capabilities through collaborative division of labor.<n>We propose AgentAsk, a lightweight and plug-and-play clarification module that treats every inter-agent message as a potential failure point and inserts minimally necessary questions to arrest error propagation.<n>AgentAsk consistently improves accuracy and robustness over public multi-agent implementations while keeping overhead minimal, with latency and extra cost all less than 5%.
arXiv Detail & Related papers (2025-10-08T22:36:05Z) - Where LLM Agents Fail and How They can Learn From Failures [62.196870049524364]
Large Language Model (LLM) agents have shown promise in solving complex, multi-step tasks.<n>They amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions.<n>Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way.<n>We introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system-level operations.
arXiv Detail & Related papers (2025-09-29T18:20:27Z) - AgentSight: System-Level Observability for AI Agents Using eBPF [10.37440633887049]
Existing tools observe either an agent's high-level intent (via LLM prompts) or its low-level actions (e.g., system calls) but cannot correlate these two views.<n>We introduce AgentSight, an AgentOps observability framework that bridges this semantic gap using a hybrid approach.<n>AgentSight intercepts TLS-encrypted LLM traffic to extract semantic intent, monitors kernel events to observe system-wide effects, and causally correlates these two streams across process boundaries.
arXiv Detail & Related papers (2025-08-02T01:43:39Z) - Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems [50.29939179830491]
Failure attribution in LLM multi-agent systems remains underexplored and labor-intensive.<n>We develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons.<n>The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps.
arXiv Detail & Related papers (2025-04-30T23:09:44Z) - MALT: Improving Reasoning with Multi-Agent LLM Training [66.9481561915524]
MALT (Multi-Agent LLM Training) is a novel post-training strategy that divides the reasoning process into generation, verification, and refinement steps.<n>On MATH, GSM8K, and CSQA, MALT surpasses the same baseline LLM with a relative improvement of 15.66%, 7.42%, and 9.40% respectively.
arXiv Detail & Related papers (2024-12-02T19:30:36Z) - AgentMonitor: A Plug-and-Play Framework for Predictive and Secure Multi-Agent Systems [43.333567687032904]
AgentMonitor is a framework that integrates at the agent level to capture inputs and outputs, transforming them into statistics for training a regression model to predict task performance.
It can further apply real-time corrections to address security risks posed by malicious agents, mitigating negative impacts and enhancing MAS security.
arXiv Detail & Related papers (2024-08-27T11:24:38Z) - DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning [56.887047551101574]
We present DS-Agent, a novel framework that harnesses large language models (LLMs) agent and case-based reasoning (CBR)
In the development stage, DS-Agent follows the CBR framework to structure an automatic iteration pipeline, which can flexibly capitalize on the expert knowledge from Kaggle.
In the deployment stage, DS-Agent implements a low-resource deployment stage with a simplified CBR paradigm, significantly reducing the demand on foundational capabilities of LLMs.
arXiv Detail & Related papers (2024-02-27T12:26:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.