Flow-of-Action: SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis
- URL: http://arxiv.org/abs/2502.08224v1
- Date: Wed, 12 Feb 2025 09:07:25 GMT
- Title: Flow-of-Action: SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis
- Authors: Changhua Pei, Zexin Wang, Fengrui Liu, Zeyan Li, Yang Liu, Xiao He, Rong Kang, Tieying Zhang, Jianjun Chen, Jianhui Li, Gaogang Xie, Dan Pei,
- Abstract summary: A contemporary trend involves harnessing Large Language Models (LLMs) as automated agents for Root Cause Analysis (RCA)
We propose Flow-of-Action, a pioneering Standard Operation Procedure ( SOP) enhanced multi-agent system.
Compared to the ReAct method's 35.50% accuracy, our Flow-of-Action method achieves 64.01%, meeting the accuracy requirements for RCA in real-world systems.
- Score: 19.357332854860665
- License:
- Abstract: In the realm of microservices architecture, the occurrence of frequent incidents necessitates the employment of Root Cause Analysis (RCA) for swift issue resolution. It is common that a serious incident can take several domain experts hours to identify the root cause. Consequently, a contemporary trend involves harnessing Large Language Models (LLMs) as automated agents for RCA. Though the recent ReAct framework aligns well with the Site Reliability Engineers (SREs) for its thought-action-observation paradigm, its hallucinations often lead to irrelevant actions and directly affect subsequent results. Additionally, the complex and variable clues of the incident can overwhelm the model one step further. To confront these challenges, we propose Flow-of-Action, a pioneering Standard Operation Procedure (SOP) enhanced LLM-based multi-agent system. By explicitly summarizing the diagnosis steps of SREs, SOP imposes constraints on LLMs at crucial junctures, guiding the RCA process towards the correct trajectory. To facilitate the rational and effective utilization of SOPs, we design an SOP-centric framework called SOP flow. SOP flow contains a series of tools, including one for finding relevant SOPs for incidents, another for automatically generating SOPs for incidents without relevant ones, and a tool for converting SOPs into code. This significantly alleviates the hallucination issues of ReAct in RCA tasks. We also design multiple auxiliary agents to assist the main agent by removing useless noise, narrowing the search space, and informing the main agent whether the RCA procedure can stop. Compared to the ReAct method's 35.50% accuracy, our Flow-of-Action method achieves 64.01%, meeting the accuracy requirements for RCA in real-world systems.
Related papers
- Causal Mean Field Multi-Agent Reinforcement Learning [10.767740092703777]
A framework named mean-field reinforcement learning (MFRL) could alleviate the scalability problem by employing the Mean Field Theory.
This framework lacks the ability to identify essential interactions under nonstationary environments.
We propose an algorithm called causal mean-field Q-learning (CMFQ) to address the scalability problem.
arXiv Detail & Related papers (2025-02-20T02:15:58Z) - Self-Regulation and Requesting Interventions [63.5863047447313]
We propose an offline framework that trains a "helper" policy to request interventions.
We score optimal intervention timing with PRMs and train the helper model on these labeled trajectories.
This offline approach significantly reduces costly intervention calls during training.
arXiv Detail & Related papers (2025-02-07T00:06:17Z) - Textualized Agent-Style Reasoning for Complex Tasks by Multiple Round LLM Generation [49.27250832754313]
We present AgentCOT, a llm-based autonomous agent framework.
At each step, AgentCOT selects an action and executes it to yield an intermediate result with supporting evidence.
We introduce two new strategies to enhance the performance of AgentCOT.
arXiv Detail & Related papers (2024-09-19T02:20:06Z) - Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing [63.20133320524577]
We show that editing a small subset of parameters can effectively modulate specific behaviors of large language models (LLMs)
Our approach achieves reductions of up to 90.0% in toxicity on the RealToxicityPrompts dataset and 49.2% on ToxiGen.
arXiv Detail & Related papers (2024-07-11T17:52:03Z) - Exploring LLM-based Agents for Root Cause Analysis [17.053079105858497]
Root cause analysis (RCA) is a critical part of the incident management process.
Large Language Models (LLMs) have been used to perform RCA, but are not able to collect additional diagnostic information.
We present an evaluation of a ReAct agent equipped with retrieval tools, on an out-of-distribution dataset of production incidents collected at Microsoft.
arXiv Detail & Related papers (2024-03-07T00:44:01Z) - Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization.
We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data.
We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z) - RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models [46.476439550746136]
Large language model (LLM) applications in cloud root cause analysis (RCA) have been actively explored recently.
We present RCAgent, a tool-augmented LLM autonomous agent framework for practical and privacy-aware industrial RCA usage.
Running on an internally deployed model rather than GPT families, RCAgent is capable of free-form data collection and comprehensive analysis with tools.
arXiv Detail & Related papers (2023-10-25T03:53:31Z) - Deep Reinforcement Learning for Uplink Scheduling in NOMA-URLLC Networks [7.182684187774442]
This article addresses the problem of Ultra Reliable Low Communications (URLLC) in wireless networks, a framework with particularly stringent constraints imposed by many Internet of Things (IoT) applications from diverse sectors.
We propose a novel Deep Reinforcement Learning (DRL) scheduling algorithm, to solve the Non-Orthogonal Multiple Access (NOMA) uplink URLLC scheduling problem involving strict deadlines.
arXiv Detail & Related papers (2023-08-28T12:18:02Z) - Disentangled Causal Graph Learning for Online Unsupervised Root Cause
Analysis [49.910053255238566]
Root cause analysis (RCA) can identify the root causes of system faults/failures by analyzing system monitoring data.
Previous research has mostly focused on developing offline RCA algorithms, which often require manually initiating the RCA process.
We propose CORAL, a novel online RCA framework that can automatically trigger the RCA process and incrementally update the RCA model.
arXiv Detail & Related papers (2023-05-18T01:27:48Z) - Fast Decomposition of Temporal Logic Specifications for Heterogeneous
Teams [1.856334276134661]
We focus on decomposing large multi-agent path planning problems into smaller sub-problems that can be solved and executed independently.
The agents' missions are given as Capability Temporal Logic (CaTL) formulas, a fragment of signal temporal logic.
The approach we take is to decompose both the temporal logic specification and the team of agents.
arXiv Detail & Related papers (2020-09-30T18:04:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.