CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms
- URL: http://arxiv.org/abs/2111.03753v1
- Date: Fri, 5 Nov 2021 23:03:21 GMT
- Title: CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms
- Authors: Yingying Zhang, Zhengxiong Guan, Huajie Qian, Leili Xu, Hengbo Liu,
Qingsong Wen, Liang Sun, Junwei Jiang, Lunting Fan, Min Ke
- Abstract summary: We propose a root cause analysis framework called CloudRCA.
It makes use of heterogeneous multi-source data including Key Performance Indicators (KPIs), logs, as well as topology, and extracts important features.
It consistently outperforms existing approaches in f1-score across different cloud systems.
- Score: 10.385807432472854
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As business of Alibaba expands across the world among various industries,
higher standards are imposed on the service quality and reliability of big data
cloud computing platforms which constitute the infrastructure of Alibaba Cloud.
However, root cause analysis in these platforms is non-trivial due to the
complicated system architecture. In this paper, we propose a root cause
analysis framework called CloudRCA which makes use of heterogeneous
multi-source data including Key Performance Indicators (KPIs), logs, as well as
topology, and extracts important features via state-of-the-art anomaly
detection and log analysis techniques. The engineered features are then
utilized in a Knowledge-informed Hierarchical Bayesian Network (KHBN) model to
infer root causes with high accuracy and efficiency. Ablation study and
comprehensive experimental comparisons demonstrate that, compared to existing
frameworks, CloudRCA 1) consistently outperforms existing approaches in
f1-score across different cloud systems; 2) can handle novel types of root
causes thanks to the hierarchical structure of KHBN; 3) performs more robustly
with respect to algorithmic configurations; and 4) scales more favorably in the
data and feature sizes. Experiments also show that a cross-platform transfer
learning mechanism can be adopted to further improve the accuracy by more than
10\%. CloudRCA has been integrated into the diagnosis system of Alibaba Cloud
and employed in three typical cloud computing platforms including MaxCompute,
Realtime Compute and Hologres. It saves Site Reliability Engineers (SREs) more
than $20\%$ in the time spent on resolving failures in the past twelve months
and improves service reliability significantly.
Related papers
- Cloud-OpsBench: A Reproducible Benchmark for Agentic Root Cause Analysis in Cloud Systems [51.2882705779387]
Cloud-OpsBench is a large-scale benchmark that employs a State Snapshot Paradigm to construct a deterministic digital twin of the cloud.<n>It features 452 distinct fault cases across 40 root cause types spanning the full stack.
arXiv Detail & Related papers (2026-02-28T05:04:42Z) - MicroRCA-Agent: Microservice Root Cause Analysis Method Based on Large Language Model Agents [12.160412894251406]
MicroRCA-Agent is an innovative solution for microservice root cause analysis based on large language model agents.<n>The proposed solution demonstrates superior performance in complex microservice fault scenarios, achieving a final score of 50.71.
arXiv Detail & Related papers (2025-09-19T05:57:03Z) - KPIRoot+: An Efficient Integrated Framework for Anomaly Detection and Root Cause Analysis in Large-Scale Cloud Systems [28.36823614956519]
We propose an efficient method combining similarity and causality analysis.<n>It uses symbolic aggregate approximation for compact representation, improving analysis efficiency.<n> deployment in Cloud H revealed two drawbacks: anomaly detection misses some performance anomalies, and SAX representation fails to capture intricate variation trends.
arXiv Detail & Related papers (2025-06-05T02:42:07Z) - Simplifying Root Cause Analysis in Kubernetes with StateGraph and LLM [13.293736787442414]
We introduce SynergyRCA, an innovative tool for root cause analysis.<n> SynergyRCA constructs a StateGraph to capture spatial and temporal relationships.<n>It can identify root causes in an average time of about two minutes and achieves an impressive precision of approximately 0.90.
arXiv Detail & Related papers (2025-06-03T06:09:13Z) - AnomalyGen: An Automated Semantic Log Sequence Generation Framework with LLM for Anomaly Detection [25.83270938475311]
AnomalyGen is the first automated log synthesis framework specifically designed for anomaly detection.
Our framework integrates enhanced program analysis with Chain-of-Thought reasoning (CoT reasoning) to enable iterative log generation and anomaly annotation.
When augmenting benchmark datasets with synthesized logs, we observe maximum F1-score improvements of 3.7%.
arXiv Detail & Related papers (2025-04-16T16:54:38Z) - Causal AI-based Root Cause Identification: Research to Practice at Scale [2.455633941531165]
We have developed a novel causality-based Root Cause Identification (RCI) algorithm that emphasizes causation over correlation.
This paper highlights Instana's advanced failure diagnosis capabilities, discussing both the theoretical underpinnings and practical implementations of the RCI algorithm.
arXiv Detail & Related papers (2025-02-25T14:20:33Z) - Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge.
Existing methods struggle to balance high model performance with low resource consumption.
We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z) - Cloud Atlas: Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight [12.272468397322738]
We present Atlas, a novel approach to automatically synthesizing causal graphs for cloud systems.
We evaluate Atlas across a range of fault localization scenarios and demonstrate that Atlas is capable of generating causal graphs in a scalable and generalizable manner.
arXiv Detail & Related papers (2024-07-11T17:31:12Z) - CHASE: A Causal Heterogeneous Graph based Framework for Root Cause Analysis in Multimodal Microservice Systems [22.00860661894853]
We propose a Causal Heterogeneous grAph baSed framEwork for root cause analysis, namely CHASE, for microservice systems with multimodal data.
CHASE learns from the constructed hypergraph with hyperedges representing the flow of causality and performs root cause localization.
arXiv Detail & Related papers (2024-06-28T07:46:51Z) - Scalable Spatiotemporal Prediction with Bayesian Neural Fields [3.3299088915999295]
BayesNF is a novel deep neural network architecture for high-capacity function estimation.
We evaluate BayesNF against statistical machine-learning prediction problems from climate and public health datasets.
arXiv Detail & Related papers (2024-03-12T13:47:50Z) - Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization.
We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data.
We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z) - $\beta$-DARTS++: Bi-level Regularization for Proxy-robust Differentiable
Architecture Search [96.99525100285084]
Regularization method, Beta-Decay, is proposed to regularize the DARTS-based NAS searching process (i.e., $beta$-DARTS)
In-depth theoretical analyses on how it works and why it works are provided.
arXiv Detail & Related papers (2023-01-16T12:30:32Z) - Distributed intelligence on the Edge-to-Cloud Continuum: A systematic
literature review [62.997667081978825]
This review aims at providing a comprehensive vision of the main state-of-the-art libraries and frameworks for machine learning and data analytics available today.
The main simulation, emulation, deployment systems, and testbeds for experimental research on the Edge-to-Cloud Continuum available today are also surveyed.
arXiv Detail & Related papers (2022-04-29T08:06:05Z) - NetRCA: An Effective Network Fault Cause Localization Algorithm [22.88986905436378]
Localizing root cause of network faults is crucial to network operation and maintenance.
We propose a novel algorithm named NetRCA to deal with this problem.
Experiments and analysis are conducted on the real-world dataset from ICASSP 2022 AIOps Challenge.
arXiv Detail & Related papers (2022-02-23T02:03:35Z) - FIXME: Enhance Software Reliability with Hybrid Approaches in Cloud [4.160063446731227]
We introduce FIXME to enhance software reliability with hybrid diagnosis approaches for enterprises.
Our evaluation results show using hybrid diagnosis approach is about 17% better in precision.
arXiv Detail & Related papers (2021-02-17T02:34:26Z) - A Privacy-Preserving Distributed Architecture for
Deep-Learning-as-a-Service [68.84245063902908]
This paper introduces a novel distributed architecture for deep-learning-as-a-service.
It is able to preserve the user sensitive data while providing Cloud-based machine and deep learning services.
arXiv Detail & Related papers (2020-03-30T15:12:03Z) - Searching Central Difference Convolutional Networks for Face
Anti-Spoofing [68.77468465774267]
Face anti-spoofing (FAS) plays a vital role in face recognition systems.
Most state-of-the-art FAS methods rely on stacked convolutions and expert-designed network.
Here we propose a novel frame level FAS method based on Central Difference Convolution (CDC)
arXiv Detail & Related papers (2020-03-09T12:48:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.