Related papers: When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling

When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling

URL: http://arxiv.org/abs/2601.15232v1
Date: Wed, 21 Jan 2026 18:13:10 GMT
Title: When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling
Authors: Niful Islam, Ragib Shahriar Ayon, Deepak George Thomas, Shibbir Ahmed, Mohammad Wardat,
Abstract summary: We present the first comprehensive study of bug types, root causes, and effects in LLM agent-based software.<n>We collected and analyzed 1,187 bug-related posts and code snippets from Stack Overflow, GitHub, and Hugging Face forums.<n>We built a ReAct agent named BugReAct, equipped with adequate external tools to determine whether it can detect and annotate the bugs in our dataset.
Score: 3.0882445204950013
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have revolutionized intelligent application development. While standalone LLMs cannot perform any actions, LLM agents address the limitation by integrating tools. However, debugging LLM agents is difficult and costly as the field is still in it's early stage and the community is underdeveloped. To understand the bugs encountered during agent development, we present the first comprehensive study of bug types, root causes, and effects in LLM agent-based software. We collected and analyzed 1,187 bug-related posts and code snippets from Stack Overflow, GitHub, and Hugging Face forums, focused on LLM agents built with seven widely used LLM frameworks as well as custom implementations. For a deeper analysis, we have also studied the component where the bug occurred, along with the programming language and framework. This study also investigates the feasibility of automating bug identification. For that, we have built a ReAct agent named BugReAct, equipped with adequate external tools to determine whether it can detect and annotate the bugs in our dataset. According to our study, we found that BugReAct equipped with Gemini 2.5 Flash achieved a remarkable performance in annotating bug characteristics with an average cost of 0.01 USD per post/code snippet.

Related papers

InspectCoder: Dynamic Analysis-Enabled Self Repair through interactive LLM-Debugger Collaboration [71.18377595277018]
Large Language Models (LLMs) frequently generate buggy code with complex logic errors that are challenging to diagnose.<n>We present InspectCoder, the first agentic program repair system that empowers LLMs to actively conduct dynamic analysis via interactive debugger control.
arXiv Detail & Related papers (2025-10-21T06:26:29Z)
Where LLM Agents Fail and How They can Learn From Failures [62.196870049524364]
Large Language Model (LLM) agents have shown promise in solving complex, multi-step tasks.<n>They amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions.<n>Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way.<n>We introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system-level operations.
arXiv Detail & Related papers (2025-09-29T18:20:27Z)
Do AI models help produce verified bug fixes? [62.985237003585674]
Large Language Models are used to produce corrections to software bugs.<n>This paper investigates how programmers use Large Language Models to complement their own skills.<n>The results are a first step towards a proper role for AI and LLMs in providing guaranteed-correct fixes to program bugs.
arXiv Detail & Related papers (2025-07-21T17:30:16Z)
PATCH: Empowering Large Language Model with Programmer-Intent Guidance and Collaborative-Behavior Simulation for Automatic Bug Fixing [34.768989900184636]
Bug fixing holds significant importance in software development and maintenance.<n>Recent research has made substantial strides in exploring the potential of large language models (LLMs) for automatically resolving software bugs.
arXiv Detail & Related papers (2025-01-27T15:43:04Z)
ToolScan: A Benchmark for Characterizing Errors in Tool-Use LLMs [77.79172008184415]
TOOLSCAN is a new benchmark to identify error patterns in LLM output on tool-use tasks.<n>We show that even the most prominent LLMs exhibit these error patterns in their outputs.<n>Researchers can use these insights from TOOLSCAN to guide their error mitigation strategies.
arXiv Detail & Related papers (2024-11-20T18:56:22Z)
A Unified Debugging Approach via LLM-Based Multi-Agent Synergy [39.11825182386288]
FixAgent is an end-to-end framework for unified debug through multi-agent synergy. It significantly outperforms state-of-the-art repair methods, fixing 1.25$times$ to 2.56$times$ bugs on the repo-level benchmark, Defects4J.
arXiv Detail & Related papers (2024-04-26T04:55:35Z)
AgentFL: Scaling LLM-based Fault Localization to Project-Level Context [11.147750199280813]
This paper presents AgentFL, a multi-agent system based on ChatGPT for automated fault localization.<n>By simulating the behavior of a human developer, AgentFL models the FL task as a three-step process, which involves comprehension, navigation, and confirmation.<n>The evaluation on the widely used Defects4J-V1.2.0 benchmark shows that AgentFL can localize 157 out of 395 bugs within Top-1.
arXiv Detail & Related papers (2024-03-25T01:58:19Z)
DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs) It covers four major bug categories and 18 minor types in C++, Java, and Python. We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z)
Evaluating Diverse Large Language Models for Automatic and General Bug Reproduction [12.851941377433285]
Large language models (LLMs) have been demonstrated to be adept at natural language processing and code generation. Our proposed technique LIBRO could successfully reproduce about one-third of all bugs in the widely used Defects4J benchmark.
arXiv Detail & Related papers (2023-11-08T08:42:30Z)
AgentBench: Evaluating LLMs as Agents [99.12825098528212]
Large Language Model (LLM) as agents has been widely acknowledged recently.<n>We present AgentBench, a benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities.
arXiv Detail & Related papers (2023-08-07T16:08:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.