Cloud-Based AI Systems: Leveraging Large Language Models for Intelligent Fault Detection and Autonomous Self-Healing
- URL: http://arxiv.org/abs/2505.11743v1
- Date: Fri, 16 May 2025 23:02:57 GMT
- Title: Cloud-Based AI Systems: Leveraging Large Language Models for Intelligent Fault Detection and Autonomous Self-Healing
- Authors: Cheng Ji, Huaiying Luo,
- Abstract summary: We propose a novel AI framework based on Massive Language Model (LLM) for intelligent fault detection and self-healing mechanisms in cloud systems.<n>The proposed model is significantly better than the traditional fault detection system in terms of fault detection accuracy, system downtime reduction and recovery speed.
- Score: 1.819979627431298
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rapid development of cloud computing systems and the increasing complexity of their infrastructure, intelligent mechanisms to detect and mitigate failures in real time are becoming increasingly important. Traditional methods of failure detection are often difficult to cope with the scale and dynamics of modern cloud environments. In this study, we propose a novel AI framework based on Massive Language Model (LLM) for intelligent fault detection and self-healing mechanisms in cloud systems. The model combines existing machine learning fault detection algorithms with LLM's natural language understanding capabilities to process and parse system logs, error reports, and real-time data streams through semantic context. The method adopts a multi-level architecture, combined with supervised learning for fault classification and unsupervised learning for anomaly detection, so that the system can predict potential failures before they occur and automatically trigger the self-healing mechanism. Experimental results show that the proposed model is significantly better than the traditional fault detection system in terms of fault detection accuracy, system downtime reduction and recovery speed.
Related papers
- An Intelligent Fault Self-Healing Mechanism for Cloud AI Systems via Integration of Large Language Models and Deep Reinforcement Learning [1.1149781202731994]
We propose an Intelligent Fault Self-Healing Mechanism (IFSHM) that integrates Large Language Model (LLM) and Deep Reinforcement Learning (DRL)<n>IFSHM aims to realize a fault recovery framework with semantic understanding and policy optimization capabilities in cloud AI systems.<n> Experimental results on the cloud fault injection platform show that compared with the existing DRL and rule methods, the IFSHM framework shortens the system recovery time by 37% with unknown fault scenarios.
arXiv Detail & Related papers (2025-06-09T04:14:05Z) - Anomaly Detection and Early Warning Mechanism for Intelligent Monitoring Systems in Multi-Cloud Environments Based on LLM [1.1149781202731994]
We propose an anomaly detection and early warning mechanism for intelligent monitoring system in multi-cloud environment based on Large-Scale Language Model (LLM)<n>On the basis of the existing monitoring framework, the proposed model innovatively introduces a multi-level feature extraction method to enhance the accuracy of anomaly detection and improve the real-time response efficiency.<n> Experimental results show that the proposed model is significantly better than the traditional anomaly detection system in terms of detection accuracy and latency, and significantly improves the resilience and active management ability of cloud infrastructure.
arXiv Detail & Related papers (2025-06-09T04:00:23Z) - Time-Series Learning for Proactive Fault Prediction in Distributed Systems with Deep Neural Structures [5.572536027964037]
This paper addresses the challenges of fault prediction and delayed response in distributed systems.<n>We use a Gated Recurrent Unit to model the evolution of system states over time.<n>An attention mechanism is then applied to enhance key temporal segments, improving the model's ability to identify potential faults.
arXiv Detail & Related papers (2025-05-27T04:31:12Z) - Research on Cloud Platform Network Traffic Monitoring and Anomaly Detection System based on Large Language Models [5.524069089627854]
This paper introduces a large language model (LLM)-based network traffic monitoring and anomaly detection system.<n>A pre-trained large language model analyzes and predicts the probable network traffic, and an anomaly detection layer considers temporality and context.<n>Results show that the designed model outperforms traditional methods in detection accuracy and computational efficiency.
arXiv Detail & Related papers (2025-04-22T07:42:07Z) - Adaptive Fault Tolerance Mechanisms of Large Language Models in Cloud Computing Environments [5.853391005435494]
This study proposes a novel adaptive fault tolerance mechanism to ensure the reliability and availability of large-scale language models in cloud computing scenarios.<n>It builds upon known fault-tolerant mechanisms, such as checkpointing, redundancy, and state transposition, introducing dynamic resource allocation and prediction of failure based on real-time performance metrics.
arXiv Detail & Related papers (2025-03-15T18:45:33Z) - Real-Time Anomaly Detection and Reactive Planning with Large Language Models [18.57162998677491]
Foundation models, e.g., large language models (LLMs), trained on internet-scale data possess zero-shot capabilities.
We present a two-stage reasoning framework that incorporates the judgement regarding potential anomalies into a safe control framework.
This enables our monitor to improve the trustworthiness of dynamic robotic systems, such as quadrotors or autonomous vehicles.
arXiv Detail & Related papers (2024-07-11T17:59:22Z) - Representing Timed Automata and Timing Anomalies of Cyber-Physical
Production Systems in Knowledge Graphs [51.98400002538092]
This paper aims to improve model-based anomaly detection in CPPS by combining the learned timed automaton with a formal knowledge graph about the system.
Both the model and the detected anomalies are described in the knowledge graph in order to allow operators an easier interpretation of the model and the detected anomalies.
arXiv Detail & Related papers (2023-08-25T15:25:57Z) - AttNS: Attention-Inspired Numerical Solving For Limited Data Scenarios [51.94807626839365]
We propose the attention-inspired numerical solver (AttNS) to solve differential equations due to limited data.<n>AttNS is inspired by the effectiveness of attention modules in Residual Neural Networks (ResNet) in enhancing model generalization and robustness.
arXiv Detail & Related papers (2023-02-05T01:39:21Z) - Towards an Awareness of Time Series Anomaly Detection Models'
Adversarial Vulnerability [21.98595908296989]
We demonstrate that the performance of state-of-the-art anomaly detection methods is degraded substantially by adding only small adversarial perturbations to the sensor data.
We use different scoring metrics such as prediction errors, anomaly, and classification scores over several public and private datasets.
We demonstrate, for the first time, the vulnerabilities of anomaly detection systems against adversarial attacks.
arXiv Detail & Related papers (2022-08-24T01:55:50Z) - Self-Supervised Training with Autoencoders for Visual Anomaly Detection [61.62861063776813]
We focus on a specific use case in anomaly detection where the distribution of normal samples is supported by a lower-dimensional manifold.
We adapt a self-supervised learning regime that exploits discriminative information during training but focuses on the submanifold of normal examples.
We achieve a new state-of-the-art result on the MVTec AD dataset -- a challenging benchmark for visual anomaly detection in the manufacturing domain.
arXiv Detail & Related papers (2022-06-23T14:16:30Z) - Data-driven Residual Generation for Early Fault Detection with Limited
Data [4.129225533930966]
In many complex systems it is not feasible to develop highly accurate models for the systems.
Data-driven solutions have received an immense attention in the industries systems for several practical reasons.
Unlike the model-based methods it is straight forward to combine time series measurements such as pressure and voltage with other sources of information.
arXiv Detail & Related papers (2021-09-28T03:18:03Z) - TELESTO: A Graph Neural Network Model for Anomaly Classification in
Cloud Services [77.454688257702]
Machine learning (ML) and artificial intelligence (AI) are applied on IT system operation and maintenance.
One direction aims at the recognition of re-occurring anomaly types to enable remediation automation.
We propose a method that is invariant to dimensionality changes of given data.
arXiv Detail & Related papers (2021-02-25T14:24:49Z) - Using Data Assimilation to Train a Hybrid Forecast System that Combines
Machine-Learning and Knowledge-Based Components [52.77024349608834]
We consider the problem of data-assisted forecasting of chaotic dynamical systems when the available data is noisy partial measurements.
We show that by using partial measurements of the state of the dynamical system, we can train a machine learning model to improve predictions made by an imperfect knowledge-based model.
arXiv Detail & Related papers (2021-02-15T19:56:48Z) - A Novel Anomaly Detection Algorithm for Hybrid Production Systems based
on Deep Learning and Timed Automata [73.38551379469533]
DAD:DeepAnomalyDetection is a new approach for automatic model learning and anomaly detection in hybrid production systems.
It combines deep learning and timed automata for creating behavioral model from observations.
The algorithm has been applied to few data sets including two from real systems and has shown promising results.
arXiv Detail & Related papers (2020-10-29T08:27:43Z) - Self-organizing Democratized Learning: Towards Large-scale Distributed
Learning Systems [71.14339738190202]
democratized learning (Dem-AI) lays out a holistic philosophy with underlying principles for building large-scale distributed and democratized machine learning systems.
Inspired by Dem-AI philosophy, a novel distributed learning approach is proposed in this paper.
The proposed algorithms demonstrate better results in the generalization performance of learning models in agents compared to the conventional FL algorithms.
arXiv Detail & Related papers (2020-07-07T08:34:48Z) - Enhanced Adversarial Strategically-Timed Attacks against Deep
Reinforcement Learning [91.13113161754022]
We introduce timing-based adversarial strategies against a DRL-based navigation system by jamming in physical noise patterns on the selected time frames.
Our experimental results show that the adversarial timing attacks can lead to a significant performance drop.
arXiv Detail & Related papers (2020-02-20T21:39:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.