From Observability Data to Diagnosis: An Evolving Multi-agent System for Incident Management in Cloud Systems
- URL: http://arxiv.org/abs/2510.24145v2
- Date: Fri, 07 Nov 2025 07:03:20 GMT
- Title: From Observability Data to Diagnosis: An Evolving Multi-agent System for Incident Management in Cloud Systems
- Authors: Yu Luo, Jiamin Jiang, Jingfei Feng, Lei Tao, Qingliang Zhang, Xidao Wen, Yongqian Sun, Shenglin Zhang, Dan Pei,
- Abstract summary: OpsAgent is a lightweight, self-evolving multi-agent system for incident management.<n>It employs a training-free data processor to convert heterogeneous observability data into structured textual descriptions.<n>OpsAgent is generalizable, interpretable, cost-efficient, and self-evolving.
- Score: 9.492890623016335
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Incident management (IM) is central to the reliability of large-scale cloud systems. Yet manual IM, where on-call engineers examine metrics, logs, and traces is labor-intensive and error-prone in the face of massive and heterogeneous observability data. Existing automated IM approaches often struggle to generalize across systems, provide limited interpretability, and incur high deployment costs, which hinders adoption in practice. In this paper, we present OpsAgent, a lightweight, self-evolving multi-agent system for IM that employs a training-free data processor to convert heterogeneous observability data into structured textual descriptions, along with a multi-agent collaboration framework that makes diagnostic inference transparent and auditable. To support continual capability growth, OpsAgent also introduces a dual self-evolution mechanism that integrates internal model updates with external experience accumulation, thereby closing the deployment loop. Comprehensive experiments on the OPENRCA benchmark demonstrate state-of-the-art performance and show that OpsAgent is generalizable, interpretable, cost-efficient, and self-evolving, making it a practically deployable and sustainable solution for long-term operation in real-world cloud systems.
Related papers
- SERM: Self-Evolving Relevance Model with Agent-Driven Learning from Massive Query Streams [53.78257200138774]
We propose a Self-Evolving Relevance Model approach (SERM), which comprises two complementary multi-agent modules.<n>We evaluate SERM in a large-scale industrial setting, which serves billions of user requests daily.
arXiv Detail & Related papers (2026-01-14T14:31:16Z) - Explainable and Fine-Grained Safeguarding of LLM Multi-Agent Systems via Bi-Level Graph Anomaly Detection [76.91230292971115]
Large language model (LLM)-based multi-agent systems (MAS) have shown strong capabilities in solving complex tasks.<n>XG-Guard is an explainable and fine-grained safeguarding framework for detecting malicious agents in MAS.
arXiv Detail & Related papers (2025-12-21T13:46:36Z) - Towards Efficient Agents: A Co-Design of Inference Architecture and System [66.59916327634639]
This paper presents AgentInfer, a unified framework for end-to-end agent acceleration.<n>We decompose the problem into four synergistic components: AgentCollab, AgentSched, AgentSAM, and AgentCompress.<n>Experiments on the BrowseComp-zh and DeepDiver benchmarks demonstrate that through the synergistic collaboration of these methods, AgentInfer reduces ineffective token consumption by over 50%.
arXiv Detail & Related papers (2025-12-20T12:06:13Z) - AgentEvolver: Towards Efficient Self-Evolving Agent System [51.54882384204726]
We present AgentEvolver, a self-evolving agent system that drives autonomous agent learning.<n>AgentEvolver introduces three synergistic mechanisms: self-questioning, self-navigating, and self-attributing.<n>Preliminary experiments indicate that AgentEvolver achieves more efficient exploration, better sample utilization, and faster adaptation compared to traditional RL-based baselines.
arXiv Detail & Related papers (2025-11-13T15:14:47Z) - A Survey of Data Agents: Emerging Paradigm or Overstated Hype? [66.1526688475023]
"Data agent" currently suffers from terminological ambiguity and inconsistent adoption.<n>This survey introduces the first systematic hierarchical taxonomy for data agents.<n>We conclude with a forward-looking roadmap, envisioning the advent of proactive, generative data agents.
arXiv Detail & Related papers (2025-10-27T17:54:07Z) - A Survey on Agentic Multimodal Large Language Models [84.18778056010629]
We present a comprehensive survey on Agentic Multimodal Large Language Models (Agentic MLLMs)<n>We explore the emerging paradigm of agentic MLLMs, delineating their conceptual foundations and distinguishing characteristics from conventional MLLM-based agents.<n>To further accelerate research in this area for the community, we compile open-source training frameworks, training and evaluation datasets for developing agentic MLLMs.
arXiv Detail & Related papers (2025-10-13T04:07:01Z) - Agentic Systems in Radiology: Design, Applications, Evaluation, and Challenges [13.53016942028838]
Large language models (LLMs) are capable of using natural language to integrate information, follow instructions, and perform forms of "reasoning" and planning.<n>With its multimodal data streams and orchestrated spanning multiple systems, radiology is uniquely suited to benefit from agents that can adapt to context and automate repetitive yet complex tasks.<n>This review examines the design of such LLM agentic systems, highlights key applications, discusses evaluation methods for planning and tool use, and outlines challenges such as error cascades, tool-use efficiency, and health IT integration.
arXiv Detail & Related papers (2025-10-10T13:56:27Z) - InfiAgent: Self-Evolving Pyramid Agent Framework for Infinite Scenarios [28.65914611521654]
InfiAgent is a Pyramid-like DAG-based Multi-Agent Framework that can be applied to textbfinfinite scenarios.<n>InfiAgent achieves 9.9% higher performance compared to ADAS (similar auto-generated agent framework)
arXiv Detail & Related papers (2025-09-26T15:44:09Z) - AgentCompass: Towards Reliable Evaluation of Agentic Workflows in Production [4.031479494871582]
We present Agent, the first evaluation framework designed specifically for post-deployment monitoring and reasoning of agentic pipeline.<n>Agent achieves state-of-the-art results on key metrics, while uncovering critical issues missed in human annotations.
arXiv Detail & Related papers (2025-09-18T05:59:04Z) - WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning [73.91893534088798]
WebSailor is a complete post-training methodology designed to instill this crucial capability.<n>Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation.<n>WebSailor significantly outperforms all open-source agents in complex information-seeking tasks.
arXiv Detail & Related papers (2025-09-16T17:57:03Z) - Multi-Agent Data Visualization and Narrative Generation [1.935127147843886]
We present a lightweight multi-agent system that automates the data analysis workflow.<n>Our approach combines a hybrid multi-agent architecture with deterministic components, strategically externalizing critical logic.<n>The system delivers granular, modular outputs that enable surgical modifications without full regeneration.
arXiv Detail & Related papers (2025-08-30T12:39:55Z) - Profile-Aware Maneuvering: A Dynamic Multi-Agent System for Robust GAIA Problem Solving by AWorld [20.01452161733642]
We propose a dynamic Multi-Agent System (MAS) in our AWorld framework.<n>An Execution Agent is supervised by a Guard Agent that provides on-demand dynamic maneuvering.<n>Our system achieves first place among open-source projects on the prestigious GAIA leaderboard.
arXiv Detail & Related papers (2025-08-13T15:46:25Z) - AgentOps: Enabling Observability of LLM Agents [12.49728300301026]
Large language model (LLM) agents raise significant concerns on AI safety due to their autonomous and non-deterministic behavior.<n>We present a comprehensive taxonomy of AgentOps, identifying the artifacts and associated data that should be traced throughout the entire lifecycle of agents to achieve effective observability.<n>Our taxonomy serves as a reference template for developers to design and implement AgentOps infrastructure that supports monitoring, logging, and analytics.
arXiv Detail & Related papers (2024-11-08T02:31:03Z) - AgentScope: A Flexible yet Robust Multi-Agent Platform [66.64116117163755]
AgentScope is a developer-centric multi-agent platform with message exchange as its core communication mechanism.
The abundant syntactic tools, built-in agents and service functions, user-friendly interfaces for application demonstration and utility monitor, zero-code programming workstation, and automatic prompt tuning mechanism significantly lower the barriers to both development and deployment.
arXiv Detail & Related papers (2024-02-21T04:11:28Z) - MMRNet: Improving Reliability for Multimodal Object Detection and
Segmentation for Bin Picking via Multimodal Redundancy [68.7563053122698]
We propose a reliable object detection and segmentation system with MultiModal Redundancy (MMRNet)
This is the first system that introduces the concept of multimodal redundancy to address sensor failure issues during deployment.
We present a new label-free multi-modal consistency (MC) score that utilizes the output from all modalities to measure the overall system output reliability and uncertainty.
arXiv Detail & Related papers (2022-10-19T19:15:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.