Related papers: Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?

Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?

URL: http://arxiv.org/abs/2602.09937v1
Date: Tue, 10 Feb 2026 16:14:05 GMT
Title: Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?
Authors: Taeyoon Kim, Woohyeok Park, Hoyeong Yun, Kyungyong Lee,
Abstract summary: Failures in large-scale cloud systems incur substantial financial losses.<n>Recent efforts leverage Large Language Model (LLM) agents to automate Root Cause Analysis (RCA)<n>This paper presents a process level failure analysis of LLM-based RCA agents.
Score: 1.0966260566122241
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Failures in large-scale cloud systems incur substantial financial losses, making automated Root Cause Analysis (RCA) essential for operational stability. Recent efforts leverage Large Language Model (LLM) agents to automate this task, yet existing systems exhibit low detection accuracy even with capable models, and current evaluation frameworks assess only final answer correctness without revealing why the agent's reasoning failed. This paper presents a process level failure analysis of LLM-based RCA agents. We execute the full OpenRCA benchmark across five LLM models, producing 1,675 agent runs, and classify observed failures into 12 pitfall types across intra-agent reasoning, inter-agent communication, and agent-environment interaction. Our analysis reveals that the most prevalent pitfalls, notably hallucinated data interpretation and incomplete exploration, persist across all models regardless of capability tier, indicating that these failures originate from the shared agent architecture rather than from individual model limitations. Controlled mitigation experiments further show that prompt engineering alone cannot resolve the dominant pitfalls, whereas enriching the inter-agent communication protocol reduces communication-related failures by up to 15 percentage points. The pitfall taxonomy and diagnostic methodology developed in this work provide a foundation for designing more reliable autonomous agents for cloud RCA.

Related papers

MAS-FIRE: Fault Injection and Reliability Evaluation for LLM-Based Multi-Agent Systems [38.44649280816596]
We propose MAS-FIRE, a systematic framework for fault injection and reliability evaluation of Multi-Agent Systems.<n>We define a taxonomy of 15 fault types covering intra-agent cognitive errors and inter-agent coordination failures.<n>Applying MAS-FIRE to three representative MAS architectures, we uncover a rich set of fault-tolerant behaviors.
arXiv Detail & Related papers (2026-02-23T13:47:43Z)
Stalled, Biased, and Confused: Uncovering Reasoning Failures in LLMs for Cloud-Based Root Cause Analysis [5.532586951580959]
We present a focused empirical evaluation that isolates an LLM's reasoning behavior.<n>We produce a labeled taxonomy of 16 common RCA reasoning failures and use an LLM-as-a-Judge for annotation.
arXiv Detail & Related papers (2026-01-29T18:23:26Z)
The Bitter Lesson of Diffusion Language Models for Agentic Workflows: A Comprehensive Reality Check [54.08619694620588]
We present a comprehensive evaluation of dLLMs across two distinct agentic paradigms: Embodied Agents and Tool-Calling Agents.<n>Our results on Agentboard and BFCL reveal a "bitter lesson": current dLLMs fail to serve as reliable agentic backbones.
arXiv Detail & Related papers (2026-01-19T11:45:39Z)
Current Agents Fail to Leverage World Model as Tool for Foresight [61.82522354207919]
Generative world models offer a promising remedy: agents could use them to foresee outcomes before acting.<n>This paper empirically examines whether current agents can leverage such world models as tools to enhance their cognition.
arXiv Detail & Related papers (2026-01-07T13:15:23Z)
PublicAgent: Multi-Agent Design Principles From an LLM-Based Open Data Analysis Framework [5.863391019411233]
Large language models show promise for individual tasks, but end-to-end analytical expose fundamental limitations.<n>We present PublicAgent, a multi-agent framework that addresses these limitations through decomposition into specialized agents for intent clarification, dataset discovery, analysis, and reporting.
arXiv Detail & Related papers (2025-11-04T21:48:11Z)
Metacognitive Self-Correction for Multi-Agent System via Prototype-Guided Next-Execution Reconstruction [58.51530390018909]
Large Language Model based multi-agent systems excel at collaborative problem solving but remain brittle to cascading errors.<n>We present MASC, a metacognitive framework that endows MAS with real-time, unsupervised, step-level error detection and self-correction.
arXiv Detail & Related papers (2025-10-16T05:35:37Z)
Mind the Gap: Comparing Model- vs Agentic-Level Red Teaming with Action-Graph Observability on GPT-OSS-20B [1.036334370262262]
This paper conducts a comparative red teaming analysis of GPT-OSS-20B, a 20-billion parameter open-source model.<n>Our evaluation reveals fundamental differences between model level and agentic level vulnerability profiles.<n>Agentic level iterative attacks successfully compromise objectives that completely failed at the model level.
arXiv Detail & Related papers (2025-09-21T22:18:34Z)
Vulnerable Agent Identification in Large-Scale Multi-Agent Reinforcement Learning [49.31650627835956]
Partial agent failure becomes inevitable when systems scale up, making it crucial to identify the subset of agents whose compromise would most severely degrade overall performance.<n>In this paper, we study this Vulnerable Agent Identification (VAI) problem in large-scale multi-agent reinforcement learning (MARL)<n> Experiments show our method effectively identifies more vulnerable agents in large-scale MARL and the rule-based system, fooling system into worse failures, and learning a value function that reveals the vulnerability of each agent.
arXiv Detail & Related papers (2025-09-18T16:03:50Z)
Exploring Autonomous Agents: A Closer Look at Why They Fail When Completing Tasks [8.218266805768687]
We present a benchmark of 34 representative programmable tasks designed to rigorously assess autonomous agents.<n>We evaluate three popular open-source agent frameworks combined with two LLM backbones, observing a task completion rate of approximately 50%.<n>We develop a three-tier taxonomy of failure causes aligned with task phases, highlighting planning errors, task execution issues, and incorrect response generation.
arXiv Detail & Related papers (2025-08-18T17:55:22Z)
Risk Analysis Techniques for Governed LLM-based Multi-Agent Systems [0.0]
This report addresses the early stages of risk identification and analysis for multi-agent AI systems.<n>We examine six critical failure modes: cascading reliability failures, inter-agent communication failures, monoculture collapse, conformity bias, deficient theory of mind, and mixed motive dynamics.
arXiv Detail & Related papers (2025-08-06T06:06:57Z)
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems [50.29939179830491]
Failure attribution in LLM multi-agent systems remains underexplored and labor-intensive.<n>We develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons.<n>The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps.
arXiv Detail & Related papers (2025-04-30T23:09:44Z)
Collaborative Value Function Estimation Under Model Mismatch: A Federated Temporal Difference Analysis [55.13545823385091]
Federated reinforcement learning (FedRL) enables collaborative learning while preserving data privacy by preventing direct data exchange between agents.<n>In real-world applications, each agent may experience slightly different transition dynamics, leading to inherent model mismatches.<n>We show that even moderate levels of information sharing significantly mitigate environment-specific errors.
arXiv Detail & Related papers (2025-03-21T18:06:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.