An Intelligent Fault Self-Healing Mechanism for Cloud AI Systems via Integration of Large Language Models and Deep Reinforcement Learning
- URL: http://arxiv.org/abs/2506.07411v1
- Date: Mon, 09 Jun 2025 04:14:05 GMT
- Title: An Intelligent Fault Self-Healing Mechanism for Cloud AI Systems via Integration of Large Language Models and Deep Reinforcement Learning
- Authors: Ze Yang, Yihong Jin, Juntian Liu, Xinhe Xu,
- Abstract summary: We propose an Intelligent Fault Self-Healing Mechanism (IFSHM) that integrates Large Language Model (LLM) and Deep Reinforcement Learning (DRL)<n>IFSHM aims to realize a fault recovery framework with semantic understanding and policy optimization capabilities in cloud AI systems.<n> Experimental results on the cloud fault injection platform show that compared with the existing DRL and rule methods, the IFSHM framework shortens the system recovery time by 37% with unknown fault scenarios.
- Score: 1.1149781202731994
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As the scale and complexity of cloud-based AI systems continue to increase, the detection and adaptive recovery of system faults have become the core challenges to ensure service reliability and continuity. In this paper, we propose an Intelligent Fault Self-Healing Mechanism (IFSHM) that integrates Large Language Model (LLM) and Deep Reinforcement Learning (DRL), aiming to realize a fault recovery framework with semantic understanding and policy optimization capabilities in cloud AI systems. On the basis of the traditional DRL-based control model, the proposed method constructs a two-stage hybrid architecture: (1) an LLM-driven fault semantic interpretation module, which can dynamically extract deep contextual semantics from multi-source logs and system indicators to accurately identify potential fault modes; (2) DRL recovery strategy optimizer, based on reinforcement learning, learns the dynamic matching of fault types and response behaviors in the cloud environment. The innovation of this method lies in the introduction of LLM for environment modeling and action space abstraction, which greatly improves the exploration efficiency and generalization ability of reinforcement learning. At the same time, a memory-guided meta-controller is introduced, combined with reinforcement learning playback and LLM prompt fine-tuning strategy, to achieve continuous adaptation to new failure modes and avoid catastrophic forgetting. Experimental results on the cloud fault injection platform show that compared with the existing DRL and rule methods, the IFSHM framework shortens the system recovery time by 37% with unknown fault scenarios.
Related papers
- Agentic Reinforced Policy Optimization [66.96989268893932]
Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks.<n>Current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions.<n>We propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents.
arXiv Detail & Related papers (2025-07-26T07:53:11Z) - Model-free Reinforcement Learning for Model-based Control: Towards Safe, Interpretable and Sample-efficient Agents [6.9290255098776425]
This work introduces model-based agents as a compelling alternative for control policy approximation.<n>These models can encode prior system knowledge to inform, constrain, and aid in explaining the agent's decisions.<n>We outline the benefits and challenges of learning model-based agents.
arXiv Detail & Related papers (2025-07-17T18:59:54Z) - Leveraging Large Language Model for Intelligent Log Processing and Autonomous Debugging in Cloud AI Platforms [1.819979627431298]
This paper proposes an intelligent log processing and automatic debug framework based on Large Language Model (LLM), named Intelligent Debugger (LLM-ID)<n> Experiments on the cloud platform log dataset show that LLM-ID improves the fault location accuracy by 16.2%, which is significantly better than the current mainstream methods.
arXiv Detail & Related papers (2025-06-22T04:58:37Z) - World Models as Reference Trajectories for Rapid Motor Adaptation [0.0]
Reflexive World Models (RWM) is a dual control framework that uses world model predictions as implicit reference trajectories for rapid adaptation.<n>Our method separates the control problem into long-term reward through reinforcement learning and robust motor execution.
arXiv Detail & Related papers (2025-05-21T14:46:41Z) - Cloud-Based AI Systems: Leveraging Large Language Models for Intelligent Fault Detection and Autonomous Self-Healing [1.819979627431298]
We propose a novel AI framework based on Massive Language Model (LLM) for intelligent fault detection and self-healing mechanisms in cloud systems.<n>The proposed model is significantly better than the traditional fault detection system in terms of fault detection accuracy, system downtime reduction and recovery speed.
arXiv Detail & Related papers (2025-05-16T23:02:57Z) - Adaptive Fault Tolerance Mechanisms of Large Language Models in Cloud Computing Environments [5.853391005435494]
This study proposes a novel adaptive fault tolerance mechanism to ensure the reliability and availability of large-scale language models in cloud computing scenarios.<n>It builds upon known fault-tolerant mechanisms, such as checkpointing, redundancy, and state transposition, introducing dynamic resource allocation and prediction of failure based on real-time performance metrics.
arXiv Detail & Related papers (2025-03-15T18:45:33Z) - R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [87.30285670315334]
textbfR1-Searcher is a novel two-stage outcome-based RL approach designed to enhance the search capabilities of Large Language Models.<n>Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start.<n>Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.
arXiv Detail & Related papers (2025-03-07T17:14:44Z) - Advancing Autonomous VLM Agents via Variational Subgoal-Conditioned Reinforcement Learning [38.68600863590734]
We propose a novel framework, Variational Subgoal-Conditioned Reinforcement Learning (VSC-RL)<n>VSC-RL reformulates the decision-making problem as a variational subgoal-conditioned RL problem with the newly derived optimization objective, Subgoal Evidence Lower BOund.<n>We theoretically and empirically demonstrate that the VSC-RL can efficiently improve the learning efficiency without compromising performance guarantees.
arXiv Detail & Related papers (2025-02-11T20:57:46Z) - Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks.
We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level.
We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z) - Deep autoregressive density nets vs neural ensembles for model-based
offline reinforcement learning [2.9158689853305693]
We consider a model-based reinforcement learning algorithm that infers the system dynamics from the available data and performs policy optimization on imaginary model rollouts.
This approach is vulnerable to exploiting model errors which can lead to catastrophic failures on the real system.
We show that better performance can be obtained with a single well-calibrated autoregressive model on the D4RL benchmark.
arXiv Detail & Related papers (2024-02-05T10:18:15Z) - Efficient Model-Based Multi-Agent Mean-Field Reinforcement Learning [89.31889875864599]
We propose an efficient model-based reinforcement learning algorithm for learning in multi-agent systems.
Our main theoretical contributions are the first general regret bounds for model-based reinforcement learning for MFC.
We provide a practical parametrization of the core optimization problem.
arXiv Detail & Related papers (2021-07-08T18:01:02Z) - Self-organizing Democratized Learning: Towards Large-scale Distributed
Learning Systems [71.14339738190202]
democratized learning (Dem-AI) lays out a holistic philosophy with underlying principles for building large-scale distributed and democratized machine learning systems.
Inspired by Dem-AI philosophy, a novel distributed learning approach is proposed in this paper.
The proposed algorithms demonstrate better results in the generalization performance of learning models in agents compared to the conventional FL algorithms.
arXiv Detail & Related papers (2020-07-07T08:34:48Z) - Enhanced Adversarial Strategically-Timed Attacks against Deep
Reinforcement Learning [91.13113161754022]
We introduce timing-based adversarial strategies against a DRL-based navigation system by jamming in physical noise patterns on the selected time frames.
Our experimental results show that the adversarial timing attacks can lead to a significant performance drop.
arXiv Detail & Related papers (2020-02-20T21:39:25Z) - Information Theoretic Model Predictive Q-Learning [64.74041985237105]
We present a novel theoretical connection between information theoretic MPC and entropy regularized RL.
We develop a Q-learning algorithm that can leverage biased models.
arXiv Detail & Related papers (2019-12-31T00:29:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.