Related papers: Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems

URL: http://arxiv.org/abs/2505.00212v1
Date: Wed, 30 Apr 2025 23:09:44 GMT
Title: Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems
Authors: Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, Qingyun Wu,
Abstract summary: Failure attribution in LLM multi-agent systems remains underexplored and labor-intensive.<n>We develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons.<n>The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps.
Score: 50.29939179830491
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Failure attribution in LLM multi-agent systems-identifying the agent and step responsible for task failures-provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate a new research area: automated failure attribution for LLM multi-agent systems. To support this initiative, we introduce the Who&When dataset, comprising extensive failure logs from 127 LLM multi-agent systems with fine-grained annotations linking failures to specific agents and decisive error steps. Using the Who&When, we develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps, with some methods performing below random. Even SOTA reasoning models, such as OpenAI o1 and DeepSeek R1, fail to achieve practical usability. These results highlight the task's complexity and the need for further research in this area. Code and dataset are available at https://github.com/mingyin1/Agents_Failure_Attribution

Related papers

Why Do Multi-Agent LLM Systems Fail? [91.39266556855513]
We present MAST (Multi-Agent System Failure taxonomy), the first empirically grounded taxonomy designed to understand MAS failures.<n>We analyze seven popular MAS frameworks across over 200 tasks, involving six expert human annotators.<n>We identify 14 unique failure modes, organized into 3 overarching categories, (i) specification issues, (ii) inter-agent misalignment, and (iii) task verification.
arXiv Detail & Related papers (2025-03-17T19:04:38Z)
Defining and Detecting the Defects of the Large Language Model-based Autonomous Agents [31.126001253902416]
We present the first study focused on identifying and detecting defects in LLM Agents.<n>We collected and analyzed 6,854 relevant posts from StackOverflow to define 8 types of agent defects.<n>Our results show that Agentable achieved an overall accuracy of 88.79% and a recall rate of 91.03%.
arXiv Detail & Related papers (2024-12-24T11:54:14Z)
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents [84.96249955105777]
LLM agents may pose a greater risk if misused, but their robustness remains underexplored. We propose a new benchmark called AgentHarm to facilitate research on LLM agent misuse. We find leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking.
arXiv Detail & Related papers (2024-10-11T17:39:22Z)
On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty Agents [58.79302663733703]
Large language model-based multi-agent systems have shown great abilities across various tasks due to the collaboration of expert agents. However, the impact of clumsy or even malicious agents, on the overall performance of the system remains underexplored. This paper investigates what is the resilience of various system structures under faulty agents.
arXiv Detail & Related papers (2024-08-02T03:25:20Z)
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models [95.09157454599605]
Large Language Models (LLMs) are becoming increasingly powerful, but they still exhibit significant but subtle weaknesses. Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies. We introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks.
arXiv Detail & Related papers (2024-06-24T15:16:45Z)
AgentFL: Scaling LLM-based Fault Localization to Project-Level Context [11.147750199280813]
This paper presents AgentFL, a multi-agent system based on ChatGPT for automated fault localization. By simulating the behavior of a human developer, AgentFL models the FL task as a three-step process, which involves comprehension, navigation, and confirmation. The evaluation on the widely used Defects4J-V1.2.0 benchmark shows that AgentFL can localize 157 out of 395 bugs within Top-1.
arXiv Detail & Related papers (2024-03-25T01:58:19Z)
DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning [56.887047551101574]
We present DS-Agent, a novel framework that harnesses large language models (LLMs) agent and case-based reasoning (CBR) In the development stage, DS-Agent follows the CBR framework to structure an automatic iteration pipeline, which can flexibly capitalize on the expert knowledge from Kaggle. In the deployment stage, DS-Agent implements a low-resource deployment stage with a simplified CBR paradigm, significantly reducing the demand on foundational capabilities of LLMs.
arXiv Detail & Related papers (2024-02-27T12:26:07Z)
Understanding the Weakness of Large Language Model Agents within a Complex Android Environment [21.278266207772756]
Large language models (LLMs) have empowered intelligent agents to execute intricate tasks within domain-specific software such as browsers and games. LLMs face three primary challenges when applied to general-purpose software systems like operating systems. These challenges motivate AndroidArena, an environment and benchmark designed to evaluate LLM agents on a modern operating system.
arXiv Detail & Related papers (2024-02-09T18:19:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.