Flow-of-Action: SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis
- URL: http://arxiv.org/abs/2502.08224v1
- Date: Wed, 12 Feb 2025 09:07:25 GMT
- Title: Flow-of-Action: SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis
- Authors: Changhua Pei, Zexin Wang, Fengrui Liu, Zeyan Li, Yang Liu, Xiao He, Rong Kang, Tieying Zhang, Jianjun Chen, Jianhui Li, Gaogang Xie, Dan Pei,
- Abstract summary: A contemporary trend involves harnessing Large Language Models (LLMs) as automated agents for Root Cause Analysis (RCA)<n>We propose Flow-of-Action, a pioneering Standard Operation Procedure ( SOP) enhanced multi-agent system.<n>Compared to the ReAct method's 35.50% accuracy, our Flow-of-Action method achieves 64.01%, meeting the accuracy requirements for RCA in real-world systems.
- Score: 19.357332854860665
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the realm of microservices architecture, the occurrence of frequent incidents necessitates the employment of Root Cause Analysis (RCA) for swift issue resolution. It is common that a serious incident can take several domain experts hours to identify the root cause. Consequently, a contemporary trend involves harnessing Large Language Models (LLMs) as automated agents for RCA. Though the recent ReAct framework aligns well with the Site Reliability Engineers (SREs) for its thought-action-observation paradigm, its hallucinations often lead to irrelevant actions and directly affect subsequent results. Additionally, the complex and variable clues of the incident can overwhelm the model one step further. To confront these challenges, we propose Flow-of-Action, a pioneering Standard Operation Procedure (SOP) enhanced LLM-based multi-agent system. By explicitly summarizing the diagnosis steps of SREs, SOP imposes constraints on LLMs at crucial junctures, guiding the RCA process towards the correct trajectory. To facilitate the rational and effective utilization of SOPs, we design an SOP-centric framework called SOP flow. SOP flow contains a series of tools, including one for finding relevant SOPs for incidents, another for automatically generating SOPs for incidents without relevant ones, and a tool for converting SOPs into code. This significantly alleviates the hallucination issues of ReAct in RCA tasks. We also design multiple auxiliary agents to assist the main agent by removing useless noise, narrowing the search space, and informing the main agent whether the RCA procedure can stop. Compared to the ReAct method's 35.50% accuracy, our Flow-of-Action method achieves 64.01%, meeting the accuracy requirements for RCA in real-world systems.
Related papers
- The Multi-Agent Fault Localization System Based on Monte Carlo Tree Search Approach [2.4898626838193647]
Large language model (LLM) provides a new path for quickly locating and recovering from incidents.<n>Our method achieves a 49.29% to 128.35% improvement in root cause localization accuracy.
arXiv Detail & Related papers (2025-07-30T16:03:21Z) - DeepSieve: Information Sieving via LLM-as-a-Knowledge-Router [57.28685457991806]
DeepSieve is an agentic RAG framework that incorporates information sieving via LLM-as-a-knowledge-router.<n>Our design emphasizes modularity, transparency, and adaptability, leveraging recent advances in agentic system design.
arXiv Detail & Related papers (2025-07-29T17:55:23Z) - Agentic Reinforced Policy Optimization [66.96989268893932]
Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks.<n>Current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions.<n>We propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents.
arXiv Detail & Related papers (2025-07-26T07:53:11Z) - MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning [43.66966457772646]
MA-RAG orchestrates a collaborative set of specialized AI agents to tackle each stage of the RAG pipeline with task-aware reasoning.<n>Our design allows fine-grained control over information flow without any model fine-tuning.<n>This modular and reasoning-driven architecture enables MA-RAG to deliver robust, interpretable results.
arXiv Detail & Related papers (2025-05-26T15:05:18Z) - ProRCA: A Causal Python Package for Actionable Root Cause Analysis in Real-world Business Scenarios [2.034531141644187]
We present a pathway-tracing package built on the DoWhy causal inference library.
Our method integrates conditional anomaly scoring, noise-based attribution, and depth-first path exploration to reveal multi-hop causal chains.
arXiv Detail & Related papers (2025-03-03T12:33:17Z) - Causal Mean Field Multi-Agent Reinforcement Learning [10.767740092703777]
A framework named mean-field reinforcement learning (MFRL) could alleviate the scalability problem by employing the Mean Field Theory.
This framework lacks the ability to identify essential interactions under nonstationary environments.
We propose an algorithm called causal mean-field Q-learning (CMFQ) to address the scalability problem.
arXiv Detail & Related papers (2025-02-20T02:15:58Z) - Self-Regulation and Requesting Interventions [63.5863047447313]
We propose an offline framework that trains a "helper" policy to request interventions.<n>We score optimal intervention timing with PRMs and train the helper model on these labeled trajectories.<n>This offline approach significantly reduces costly intervention calls during training.
arXiv Detail & Related papers (2025-02-07T00:06:17Z) - Agent-Oriented Planning in Multi-Agent Systems [54.429028104022066]
We propose AOP, a novel framework for agent-oriented planning in multi-agent systems.
In this study, we identify three critical design principles of agent-oriented planning, including solvability, completeness, and non-redundancy.
Extensive experiments demonstrate the advancement of AOP in solving real-world problems compared to both single-agent systems and existing planning strategies for multi-agent systems.
arXiv Detail & Related papers (2024-10-03T04:07:51Z) - Textualized Agent-Style Reasoning for Complex Tasks by Multiple Round LLM Generation [49.27250832754313]
We present AgentCOT, a llm-based autonomous agent framework.
At each step, AgentCOT selects an action and executes it to yield an intermediate result with supporting evidence.
We introduce two new strategies to enhance the performance of AgentCOT.
arXiv Detail & Related papers (2024-09-19T02:20:06Z) - Exploring LLM-based Agents for Root Cause Analysis [17.053079105858497]
Root cause analysis (RCA) is a critical part of the incident management process.
Large Language Models (LLMs) have been used to perform RCA, but are not able to collect additional diagnostic information.
We present an evaluation of a ReAct agent equipped with retrieval tools, on an out-of-distribution dataset of production incidents collected at Microsoft.
arXiv Detail & Related papers (2024-03-07T00:44:01Z) - Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization.
We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data.
We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z) - RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models [46.476439550746136]
Large language model (LLM) applications in cloud root cause analysis (RCA) have been actively explored recently.
We present RCAgent, a tool-augmented LLM autonomous agent framework for practical and privacy-aware industrial RCA usage.
Running on an internally deployed model rather than GPT families, RCAgent is capable of free-form data collection and comprehensive analysis with tools.
arXiv Detail & Related papers (2023-10-25T03:53:31Z) - Disentangled Causal Graph Learning for Online Unsupervised Root Cause
Analysis [49.910053255238566]
Root cause analysis (RCA) can identify the root causes of system faults/failures by analyzing system monitoring data.
Previous research has mostly focused on developing offline RCA algorithms, which often require manually initiating the RCA process.
We propose CORAL, a novel online RCA framework that can automatically trigger the RCA process and incrementally update the RCA model.
arXiv Detail & Related papers (2023-05-18T01:27:48Z) - Fast Decomposition of Temporal Logic Specifications for Heterogeneous
Teams [1.856334276134661]
We focus on decomposing large multi-agent path planning problems into smaller sub-problems that can be solved and executed independently.
The agents' missions are given as Capability Temporal Logic (CaTL) formulas, a fragment of signal temporal logic.
The approach we take is to decompose both the temporal logic specification and the team of agents.
arXiv Detail & Related papers (2020-09-30T18:04:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.