Related papers: Cloud Atlas: Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight

Cloud Atlas: Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight

URL: http://arxiv.org/abs/2407.08694v1
Date: Thu, 11 Jul 2024 17:31:12 GMT
Title: Cloud Atlas: Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight
Authors: Zhiqiang Xie, Yujia Zheng, Lizi Ottens, Kun Zhang, Christos Kozyrakis, Jonathan Mace,
Abstract summary: We present Atlas, a novel approach to automatically synthesizing causal graphs for cloud systems. We evaluate Atlas across a range of fault localization scenarios and demonstrate that Atlas is capable of generating causal graphs in a scalable and generalizable manner.
Score: 12.272468397322738
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Runtime failure and performance degradation is commonplace in modern cloud systems. For cloud providers, automatically determining the root cause of incidents is paramount to ensuring high reliability and availability as prompt fault localization can enable faster diagnosis and triage for timely resolution. A compelling solution explored in recent work is causal reasoning using causal graphs to capture relationships between varied cloud system performance metrics. To be effective, however, systems developers must correctly define the causal graph of their system, which is a time-consuming, brittle, and challenging task that increases in difficulty for large and dynamic systems and requires domain expertise. Alternatively, automated data-driven approaches have limited efficacy for cloud systems due to the inherent rarity of incidents. In this work, we present Atlas, a novel approach to automatically synthesizing causal graphs for cloud systems. Atlas leverages large language models (LLMs) to generate causal graphs using system documentation, telemetry, and deployment feedback. Atlas is complementary to data-driven causal discovery techniques, and we further enhance Atlas with a data-driven validation step. We evaluate Atlas across a range of fault localization scenarios and demonstrate that Atlas is capable of generating causal graphs in a scalable and generalizable manner, with performance that far surpasses that of data-driven algorithms and is commensurate to the ground-truth baseline.

Related papers

Cloud-Based AI Systems: Leveraging Large Language Models for Intelligent Fault Detection and Autonomous Self-Healing [1.819979627431298]
We propose a novel AI framework based on Massive Language Model (LLM) for intelligent fault detection and self-healing mechanisms in cloud systems.<n>The proposed model is significantly better than the traditional fault detection system in terms of fault detection accuracy, system downtime reduction and recovery speed.
arXiv Detail & Related papers (2025-05-16T23:02:57Z)
Adaptive Fault Tolerance Mechanisms of Large Language Models in Cloud Computing Environments [5.853391005435494]
This study proposes a novel adaptive fault tolerance mechanism to ensure the reliability and availability of large-scale language models in cloud computing scenarios. It builds upon known fault-tolerant mechanisms, such as checkpointing, redundancy, and state transposition, introducing dynamic resource allocation and prediction of failure based on real-time performance metrics.
arXiv Detail & Related papers (2025-03-15T18:45:33Z)
LLM Assisted Anomaly Detection Service for Site Reliability Engineers: Enhancing Cloud Infrastructure Resilience [5.644170923282226]
This paper introduces a scalable Anomaly Detection Service with a generalizable API tailored for industrial time-series data. We provide insights into the usage patterns of the service, with over 500 users and 200,000 API calls in a year. We plan to extend the system to include time series foundation models, enabling zero-shot anomaly detection capabilities.
arXiv Detail & Related papers (2025-01-28T06:41:37Z)
Anomaly Detection in Large-Scale Cloud Systems: An Industry Case and Dataset [1.293050392312921]
We introduce a new high-dimensional dataset from IBM Cloud, collected over 4.5 months from the IBM Cloud Console. This dataset comprises 39,365 rows and 117,448 columns of telemetry data. We demonstrate the application of machine learning models for anomaly detection and discuss the key challenges faced in this process.
arXiv Detail & Related papers (2024-11-13T22:04:19Z)
Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge. Existing methods struggle to balance high model performance with low resource consumption. We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z)
Distribution-aware Interactive Attention Network and Large-scale Cloud Recognition Benchmark on FY-4A Satellite Image [24.09239785062109]
We develop a novel dataset for accurate cloud recognition. We use domain adaptation methods to align 70,419 image-label pairs in terms of projection, temporal resolution, and spatial resolution. We also introduce a Distribution-aware Interactive-Attention Network (DIAnet), which preserves pixel-level details through a high-resolution branch and a parallel cross-branch.
arXiv Detail & Related papers (2024-01-06T09:58:09Z)
Identifying Performance Issues in Cloud Service Systems Based on Relational-Temporal Features [11.83269525626691]
Cloud systems are susceptible to performance issues, which may cause service-level agreement violations and financial losses. We propose a learning-based approach that leverages both the relational and temporal features of metrics to identify performance issues.
arXiv Detail & Related papers (2023-07-20T13:41:26Z)
Deep Temporal Graph Clustering [77.02070768950145]
We propose a general framework for deep Temporal Graph Clustering (GC) GC introduces deep clustering techniques to suit the interaction sequence-based batch-processing pattern of temporal graphs. Our framework can effectively improve the performance of existing temporal graph learning methods.
arXiv Detail & Related papers (2023-05-18T06:17:50Z)
Neural Relation Graph: A Unified Framework for Identifying Label Noise and Outlier Data [44.64190826937705]
We present scalable algorithms for detecting label errors and outlier data based on the relational graph structure of data. We also introduce a visualization tool that provides contextual information of a data point in the feature-embedded space. Our approach achieves state-of-the-art detection performance on all tasks considered and demonstrates its effectiveness in large-scale real-world datasets.
arXiv Detail & Related papers (2023-01-29T02:09:13Z)
Enhancing the Analysis of Software Failures in Cloud Computing Systems with Deep Learning [0.11470070927586014]
This paper presents a novel approach for analyzing failure data from cloud systems, in order to relieve human analysts from manually fine-tuning the data for feature engineering. The approach leverages Deep Embedded Clustering (DEC), a family of unsupervised clustering algorithms based on deep learning. The results show that the performance of the proposed approach, in terms of purity of clusters, is comparable to, or in some cases even better than manually fine-tuned clustering.
arXiv Detail & Related papers (2021-06-29T09:00:41Z)
Learning Dependencies in Distributed Cloud Applications to Identify and Localize Anomalies [58.88325379746632]
We present Arvalus and its variant D-Arvalus, a neural graph transformation method that models system components as nodes and their dependencies as edges to improve the identification and localization of anomalies. Given a series of metric, our method predicts the most likely system state - either normal or an anomaly class - and performs localization when an anomaly is detected. The evaluation shows the generally good prediction performance of Arvalus and reveals the advantage of D-Arvalus which incorporates information about system component dependencies.
arXiv Detail & Related papers (2021-03-09T06:34:05Z)
Anomaly Detection on Attributed Networks via Contrastive Self-Supervised Learning [50.24174211654775]
We present a novel contrastive self-supervised learning framework for anomaly detection on attributed networks. Our framework fully exploits the local information from network data by sampling a novel type of contrastive instance pair. A graph neural network-based contrastive learning model is proposed to learn informative embedding from high-dimensional attributes and local structure.
arXiv Detail & Related papers (2021-02-27T03:17:20Z)
TELESTO: A Graph Neural Network Model for Anomaly Classification in Cloud Services [77.454688257702]
Machine learning (ML) and artificial intelligence (AI) are applied on IT system operation and maintenance. One direction aims at the recognition of re-occurring anomaly types to enable remediation automation. We propose a method that is invariant to dimensionality changes of given data.
arXiv Detail & Related papers (2021-02-25T14:24:49Z)
Adaptive Graph Auto-Encoder for General Data Clustering [90.8576971748142]
Graph-based clustering plays an important role in the clustering area. Recent studies about graph convolution neural networks have achieved impressive success on graph type data. We propose a graph auto-encoder for general data clustering, which constructs the graph adaptively according to the generative perspective of graphs.
arXiv Detail & Related papers (2020-02-20T10:11:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.