Related papers: Stalled, Biased, and Confused: Uncovering Reasoning Failures in LLMs for Cloud-Based Root Cause Analysis

Stalled, Biased, and Confused: Uncovering Reasoning Failures in LLMs for Cloud-Based Root Cause Analysis

URL: http://arxiv.org/abs/2601.22208v1
Date: Thu, 29 Jan 2026 18:23:26 GMT
Title: Stalled, Biased, and Confused: Uncovering Reasoning Failures in LLMs for Cloud-Based Root Cause Analysis
Authors: Evelien Riddell, James Riddell, Gengyi Sun, Michał Antkiewicz, Krzysztof Czarnecki,
Abstract summary: We present a focused empirical evaluation that isolates an LLM's reasoning behavior.<n>We produce a labeled taxonomy of 16 common RCA reasoning failures and use an LLM-as-a-Judge for annotation.
Score: 5.532586951580959
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Root cause analysis (RCA) is essential for diagnosing failures within complex software systems to ensure system reliability. The highly distributed and interdependent nature of modern cloud-based systems often complicates RCA efforts, particularly for multi-hop fault propagation, where symptoms appear far from their true causes. Recent advancements in Large Language Models (LLMs) present new opportunities to enhance automated RCA. However, their practical value for RCA depends on the fidelity of reasoning and decision-making. Existing work relies on historical incident corpora, operates directly on high-volume telemetry beyond current LLM capacity, or embeds reasoning inside complex multi-agent pipelines -- conditions that obscure whether failures arise from reasoning itself or from peripheral design choices. We present a focused empirical evaluation that isolates an LLM's reasoning behavior. We design a controlled experimental framework that foregrounds the LLM by using a simplified experimental setting. We evaluate six LLMs under two agentic workflows (ReAct and Plan-and-Execute) and a non-agentic baseline on two real-world case studies (GAIA and OpenRCA). In total, we executed 48,000 simulated failure scenarios, totaling 228 days of execution time. We measure both root-cause accuracy and the quality of intermediate reasoning traces. We produce a labeled taxonomy of 16 common RCA reasoning failures and use an LLM-as-a-Judge for annotation. Our results clarify where current open-source LLMs succeed and fail in multi-hop RCA, quantify sensitivity to input data modalities, and identify reasoning failures that predict final correctness. Together, these contributions provide transparent and reproducible empirical results and a failure taxonomy to guide future work on reasoning-driven system diagnosis.

Related papers

IV Co-Scientist: Multi-Agent LLM Framework for Causal Instrumental Variable Discovery [61.15184885636171]
In the presence of confounding between an endogenous variable and the outcome, instrumental variables (IVs) are used to isolate the causal effect of the endogenous variable.<n>We investigate whether large language models (LLMs) can aid in this task.<n>We introduce IV Co-Scientist, a multi-agent system that proposes, critiques, and refines IVs for a given treatment-outcome pair.
arXiv Detail & Related papers (2026-02-08T12:28:29Z)
Hypothesize-Then-Verify: Speculative Root Cause Analysis for Microservices with Pathwise Parallelism [19.31110304702373]
SpecRCA is a speculative root cause analysis framework that adopts a textithypothesize-then-verify paradigm.<n>Preliminary experiments on the AIOps 2022 demonstrate that SpecRCA achieves superior accuracy and efficiency compared to existing approaches.
arXiv Detail & Related papers (2026-01-06T05:58:25Z)
Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads [104.9566359759396]
We propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores.<n>Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification.
arXiv Detail & Related papers (2025-11-09T03:38:29Z)
GALA: Can Graph-Augmented Large Language Model Agentic Workflows Elevate Root Cause Analysis? [9.394057684388027]
This paper introduces GALA, a novel framework for root cause analysis in microservice systems.<n> evaluated on an open-source benchmark, GALA achieves substantial improvements over state-of-the-art methods.<n>We show that GALA bridges the gap between automated failure diagnosis and practical incident resolution.
arXiv Detail & Related papers (2025-08-17T19:12:05Z)
The Multi-Agent Fault Localization System Based on Monte Carlo Tree Search Approach [2.4898626838193647]
Large language model (LLM) provides a new path for quickly locating and recovering from incidents.<n>Our method achieves a 49.29% to 128.35% improvement in root cause localization accuracy.
arXiv Detail & Related papers (2025-07-30T16:03:21Z)
Reasoning Language Models for Root Cause Analysis in 5G Wireless Networks [10.074110713679739]
Root Cause Analysis (RCA) in mobile networks remains a challenging task due to the need for interpretability, domain expertise, and causal reasoning.<n>We propose a lightweight framework that leverages Large Language Models (LLMs) for RCA.
arXiv Detail & Related papers (2025-07-29T16:21:42Z)
LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities [21.42711537107199]
We study why Large Language Models (LLMs) perform sub-optimally in decision-making scenarios.<n>We propose mitigation of these shortcomings by fine-tuning via Reinforcement Learning (RL) on self-generated CoT rationales.
arXiv Detail & Related papers (2025-04-22T17:57:14Z)
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [87.30285670315334]
textbfR1-Searcher is a novel two-stage outcome-based RL approach designed to enhance the search capabilities of Large Language Models.<n>Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start.<n>Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.
arXiv Detail & Related papers (2025-03-07T17:14:44Z)
LLM2: Let Large Language Models Harness System 2 Reasoning [65.89293674479907]
Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs.<n>We introduce LLM2, a novel framework that combines an LLM with a process-based verifier.<n>LLMs2 is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs.
arXiv Detail & Related papers (2024-12-29T06:32:36Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization. We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data. We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.