Root Cause Localization for Microservice Systems in Cloud-edge Collaborative Environments
- URL: http://arxiv.org/abs/2406.13604v1
- Date: Wed, 19 Jun 2024 14:49:37 GMT
- Title: Root Cause Localization for Microservice Systems in Cloud-edge Collaborative Environments
- Authors: Yuhan Zhu, Jian Wang, Bing Li, Xuxian Tang, Hao Li, Neng Zhang, Yuqi Zhao,
- Abstract summary: Microservice-based software systems face challenges in accurately localizing root causes when failures occur.
We propose MicroCERCL, a novel approach that pinpoints root causes at the kernel and application level in the cloud-edge collaborative environment.
We show that MicroCERCL can accurately localize the root cause of microservice systems in such environments, significantly outperforming state-of-the-art approaches.
- Score: 9.694588952789257
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the development of cloud-native technologies, microservice-based software systems face challenges in accurately localizing root causes when failures occur. Additionally, the cloud-edge collaborative environment introduces more difficulties, such as unstable networks and high latency across network segments. Accurately identifying the root cause of microservices in a cloud-edge collaborative environment has thus become an urgent problem. In this paper, we propose MicroCERCL, a novel approach that pinpoints root causes at the kernel and application level in the cloud-edge collaborative environment. Our key insight is that failures propagate through direct invocations and indirect resource-competition dependencies in a cloud-edge collaborative environment characterized by instability and high latency. This will become more complex in the hybrid deployment that simultaneously involves multiple microservice systems. Leveraging this insight, we extract valid contents from kernel-level logs to prioritize localizing the kernel-level root cause. Moreover, we construct a heterogeneous dynamic topology stack and train a graph neural network model to accurately localize the application-level root cause without relying on historical data. Notably, we released the first benchmark hybrid deployment microservice system in a cloud-edge collaborative environment (the largest and most complex within our knowledge). Experiments conducted on the dataset collected from the benchmark show that MicroCERCL can accurately localize the root cause of microservice systems in such environments, significantly outperforming state-of-the-art approaches with an increase of at least 24.1% in top-1 accuracy.
Related papers
- Cloud-OpsBench: A Reproducible Benchmark for Agentic Root Cause Analysis in Cloud Systems [51.2882705779387]
Cloud-OpsBench is a large-scale benchmark that employs a State Snapshot Paradigm to construct a deterministic digital twin of the cloud.<n>It features 452 distinct fault cases across 40 root cause types spanning the full stack.
arXiv Detail & Related papers (2026-02-28T05:04:42Z) - A Secure and Private Distributed Bayesian Federated Learning Design [56.92336577799572]
Distributed Federated Learning (DFL) enables decentralized model training across large-scale systems without a central parameter server.<n>DFL faces three critical challenges: privacy leakage from honest-but-curious neighbors, slow convergence due to the lack of central coordination, and vulnerability to Byzantine adversaries aiming to degrade model accuracy.<n>We propose a novel DFL framework that integrates Byzantine robustness, privacy preservation, and convergence acceleration.
arXiv Detail & Related papers (2026-02-23T16:12:02Z) - Why Does the LLM Stop Computing: An Empirical Study of User-Reported Failures in Open-Source LLMs [50.075587392477935]
We conduct the first large-scale empirical study of 705 real-world failures from the open-source DeepSeek, Llama, and Qwen ecosystems.<n>Our analysis reveals a paradigm shift: white-box orchestration relocates the reliability bottleneck from model algorithmic defects to the systemic fragility of the deployment stack.
arXiv Detail & Related papers (2026-01-20T06:42:56Z) - Hypothesize-Then-Verify: Speculative Root Cause Analysis for Microservices with Pathwise Parallelism [19.31110304702373]
SpecRCA is a speculative root cause analysis framework that adopts a textithypothesize-then-verify paradigm.<n>Preliminary experiments on the AIOps 2022 demonstrate that SpecRCA achieves superior accuracy and efficiency compared to existing approaches.
arXiv Detail & Related papers (2026-01-06T05:58:25Z) - FC-ADL: Efficient Microservice Anomaly Detection and Localisation Through Functional Connectivity [2.994962964425238]
We propose FC-ADL, an end-to-end scalable approach for detecting and localising anomalous changes from microservice metrics.<n>We show that our approach can achieve top detection and localisation performance across a wide degree of different fault scenarios.<n>We demonstrate that our approach can achieve top detection and localisation performance across a wide degree of different fault scenarios when compared to state-of-the-art approaches.
arXiv Detail & Related papers (2025-11-30T11:29:30Z) - Benchmarking LLMs' Swarm intelligence [50.544186914115045]
Large Language Models (LLMs) show potential for complex reasoning, yet their capacity for emergent coordination in Multi-Agent Systems (MAS) remains largely unexplored.<n>We introduce SwarmBench, a novel benchmark designed to systematically evaluate tasks of LLMs acting as decentralized agents.<n>We propose metrics for coordination effectiveness and analyze emergent group dynamics.
arXiv Detail & Related papers (2025-05-07T12:32:01Z) - Towards Resource-Efficient Federated Learning in Industrial IoT for Multivariate Time Series Analysis [50.18156030818883]
Anomaly and missing data constitute a thorny problem in industrial applications.
Deep learning enabled anomaly detection has emerged as a critical direction.
The data collected in edge devices contain user privacy.
arXiv Detail & Related papers (2024-11-06T15:38:31Z) - Synthetic Time Series for Anomaly Detection in Cloud Microservices [9.44541023672687]
This paper proposes a framework for time series generation built to investigate anomaly detection in cloud computing.
We detail the pipeline implementation that allows deployment and management of as well as the theoretical approach required to generate anomalies.
Two datasets generated using the proposed framework have been made publicly available through GitHub.
arXiv Detail & Related papers (2024-07-21T11:23:54Z) - CHASE: A Causal Heterogeneous Graph based Framework for Root Cause Analysis in Multimodal Microservice Systems [22.00860661894853]
We propose a Causal Heterogeneous grAph baSed framEwork for root cause analysis, namely CHASE, for microservice systems with multimodal data.
CHASE learns from the constructed hypergraph with hyperedges representing the flow of causality and performs root cause localization.
arXiv Detail & Related papers (2024-06-28T07:46:51Z) - Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization.
We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data.
We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z) - Root Cause Analysis In Microservice Using Neural Granger Causal
Discovery [12.35924469567586]
We propose RUN, a novel approach for root cause analysis using neural Granger causal discovery with contrastive learning.
RUN enhances the backbone encoder by integrating contextual information from time series, and leverages a time series forecasting model to conduct neural Granger causal discovery.
In addition, RUN incorporates Pagerank with a vector to efficiently recommend the top-k root causes.
arXiv Detail & Related papers (2024-02-02T04:43:06Z) - Robust Collaborative Inference with Vertically Split Data Over Dynamic Device Environments [15.757660512833006]
In safety-critical applications, collaborative inference must be robust to significant network failures caused by environmental disruptions or extreme weather.
We first formalize the problem of robust collaborative inference over a dynamic network of devices that could experience significant network faults.
Then, we develop a minimalistic yet impactful method called Multiple Aggregation with Gossip Rounds and Simulated Faults (MAGS) that synthesizes simulated faults via dropout, replication, and gossiping to significantly improve robustness over baselines.
arXiv Detail & Related papers (2023-12-27T17:00:09Z) - The PetShop Dataset -- Finding Causes of Performance Issues across Microservices [3.87228935312714]
This paper introduces a dataset specifically designed for evaluating root cause analyses in microservice-based applications.
The dataset encompasses latency, requests, and availability metrics emitted in 5-minute intervals from a distributed application.
In addition to normal operation metrics, the dataset includes 68 injected performance issues, which increase latency and reduce availability throughout the system.
arXiv Detail & Related papers (2023-11-08T16:30:12Z) - Ensembles of Compact, Region-specific & Regularized Spiking Neural
Networks for Scalable Place Recognition [25.0834855255728]
Spiking neural networks have significant potential in robotics due to their high energy efficiency on specialized hardware.
This paper introduces a novel modular ensemble network approach, where compact, localized spiking networks each learn and are solely responsible for recognizing places in a local region only.
It comes with a high-performance cost where a lack of global regularization at deployment time leads to hyperactive neurons that erroneously respond to places outside their learned region.
We evaluate this new scalable modular system on benchmark localization datasets Nordland and Oxford RobotCar, with comparisons to standard techniques NetVLAD, DenseVLAD, and SAD, and a previous spiking
arXiv Detail & Related papers (2022-09-19T02:47:48Z) - Optical flow-based branch segmentation for complex orchard environments [73.11023209243326]
We train a neural network system in simulation only using simulated RGB data and optical flow.
This resulting neural network is able to perform foreground segmentation of branches in a busy orchard environment without additional real-world training or using any special setup or equipment beyond a standard camera.
Our results show that our system is highly accurate and, when compared to a network using manually labeled RGBD data, achieves significantly more consistent and robust performance across environments that differ from the training set.
arXiv Detail & Related papers (2022-02-26T03:38:20Z) - Learning Dependencies in Distributed Cloud Applications to Identify and
Localize Anomalies [58.88325379746632]
We present Arvalus and its variant D-Arvalus, a neural graph transformation method that models system components as nodes and their dependencies as edges to improve the identification and localization of anomalies.
Given a series of metric, our method predicts the most likely system state - either normal or an anomaly class - and performs localization when an anomaly is detected.
The evaluation shows the generally good prediction performance of Arvalus and reveals the advantage of D-Arvalus which incorporates information about system component dependencies.
arXiv Detail & Related papers (2021-03-09T06:34:05Z) - Towards AIOps in Edge Computing Environments [60.27785717687999]
This paper describes the system design of an AIOps platform which is applicable in heterogeneous, distributed environments.
It is feasible to collect metrics with a high frequency and simultaneously run specific anomaly detection algorithms directly on edge devices.
arXiv Detail & Related papers (2021-02-12T09:33:00Z) - A Compressive Sensing Approach for Federated Learning over Massive MIMO
Communication Systems [82.2513703281725]
Federated learning is a privacy-preserving approach to train a global model at a central server by collaborating with wireless devices.
We present a compressive sensing approach for federated learning over massive multiple-input multiple-output communication systems.
arXiv Detail & Related papers (2020-03-18T05:56:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.