Bio-inspired Agentic Self-healing Framework for Resilient Distributed Computing Continuum Systems
- URL: http://arxiv.org/abs/2601.00339v1
- Date: Thu, 01 Jan 2026 13:30:38 GMT
- Title: Bio-inspired Agentic Self-healing Framework for Resilient Distributed Computing Continuum Systems
- Authors: Alaa Saleh, Praveen Kumar Donta, Roberto Morabito, Sasu Tarkoma, Anders Lindgren, Qiyang Zhang, Schahram Dustdar Susanna Pirttikangas, Lauri Lovén,
- Abstract summary: ReCiSt is a bio-inspired agentic self-healing framework designed to achieve resilience in Distributed Computing Continuum Systems (DCCS)<n>ReCiSt reconstructs the biological phases of Hemostasis, Inflammation, Proliferation, and Remodeling into the computational layers Containment, Diagnosis, Meta-Cognitive, and Knowledge for DCCS.<n>These four layers perform autonomous fault isolation, causal diagnosis, adaptive recovery, and long-term knowledge consolidation through Language Model (LM)-powered agents.
- Score: 4.003029907200818
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human biological systems sustain life through extraordinary resilience, continually detecting damage, orchestrating targeted responses, and restoring function through self-healing. Inspired by these capabilities, this paper introduces ReCiSt, a bio-inspired agentic self-healing framework designed to achieve resilience in Distributed Computing Continuum Systems (DCCS). Modern DCCS integrate heterogeneous computing resources, ranging from resource-constrained IoT devices to high-performance cloud infrastructures, and their inherent complexity, mobility, and dynamic operating conditions expose them to frequent faults that disrupt service continuity. These challenges underscore the need for scalable, adaptive, and self-regulated resilience strategies. ReCiSt reconstructs the biological phases of Hemostasis, Inflammation, Proliferation, and Remodeling into the computational layers Containment, Diagnosis, Meta-Cognitive, and Knowledge for DCCS. These four layers perform autonomous fault isolation, causal diagnosis, adaptive recovery, and long-term knowledge consolidation through Language Model (LM)-powered agents. These agents interpret heterogeneous logs, infer root causes, refine reasoning pathways, and reconfigure resources with minimal human intervention. The proposed ReCiSt framework is evaluated on public fault datasets using multiple LMs, and no baseline comparison is included due to the scarcity of similar approaches. Nevertheless, our results, evaluated under different LMs, confirm ReCiSt's self-healing capabilities within tens of seconds with minimum of 10% of agent CPU usage. Our results also demonstrated depth of analysis to over come uncertainties and amount of micro-agents invoked to achieve resilience.
Related papers
- ARTIS: Agentic Risk-Aware Test-Time Scaling via Iterative Simulation [72.78362530982109]
ARTIS, Agentic Risk-Aware Test-Time Scaling via Iterative Simulation, is a framework that decouples exploration from commitment.<n>We show that naive LLM-based simulators struggle to capture rare but high-impact failure modes.<n>We introduce a risk-aware tool simulator that emphasizes fidelity on failure-inducing actions.
arXiv Detail & Related papers (2026-02-02T06:33:22Z) - Multi-Agent Collaborative Intrusion Detection for Low-Altitude Economy IoT: An LLM-Enhanced Agentic AI Framework [60.72591149679355]
The rapid expansion of low-altitude economy Internet of Things (LAE-IoT) networks has created unprecedented security challenges.<n>Traditional intrusion detection systems fail to tackle the unique characteristics of aerial IoT environments.<n>We introduce a large language model (LLM)-enabled agentic AI framework for enhancing intrusion detection in LAE-IoT networks.
arXiv Detail & Related papers (2026-01-25T12:47:25Z) - AgentEHR: Advancing Autonomous Clinical Decision-Making via Retrospective Summarization [28.032865974227875]
RetroSum is a framework that unifies a retrospective summarization mechanism with an evolving experience strategy.<n>It achieves performance gains of up to 29.16% over competitive baselines, while significantly decreasing total interaction errors by up to 92.3%.
arXiv Detail & Related papers (2026-01-20T12:48:04Z) - Logic-Driven Semantic Communication for Resilient Multi-Agent Systems [26.964933264412412]
6G networks are accelerating autonomy and intelligence in large-scale, decentralized multi-agent systems.<n>This article proposes a formal definition of MAS resilience grounded in two complementary dimensions.<n>We design an agent architecture and develop decentralized algorithms to achieve both epistemic and action resilience.
arXiv Detail & Related papers (2026-01-11T00:54:09Z) - SelfAI: Building a Self-Training AI System with LLM Agents [79.10991818561907]
SelfAI is a general multi-agent platform that combines a User Agent for translating high-level research objectives into standardized experimental configurations.<n>An Experiment Manager orchestrates parallel, fault-tolerant training across heterogeneous hardware while maintaining a structured knowledge base for continuous feedback.<n>Across regression, computer vision, scientific computing, medical imaging, and drug discovery benchmarks, SelfAI consistently achieves strong performance and reduces redundant trials.
arXiv Detail & Related papers (2025-11-29T09:18:39Z) - Reinforcement Learning for Self-Healing Material Systems [0.40334315349753025]
This study frames the self-healing process as a Reinforcement Learning problem within a Markov Decision Process (MDP), enabling agents to autonomously derive optimal policies.<n>A comparative evaluation of discrete-action (Q-learning, DQN) and continuous-action (TD3) agents in a simulation environment revealed that RL controllers significantly outperform baselines, achieving near-complete material recovery.
arXiv Detail & Related papers (2025-11-24T03:42:00Z) - Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution [76.66229730098759]
In real-world image super-resolution (Real-ISR), existing approaches mainly rely on fine-tuning pre-trained diffusion models.<n>We propose a Mixture-of-Ranks (MoR) architecture for single-step image super-resolution.<n>We introduce a fine-grained expert partitioning strategy that treats each rank in LoRA as an independent expert.
arXiv Detail & Related papers (2025-11-20T04:11:44Z) - The Subtle Art of Defection: Understanding Uncooperative Behaviors in LLM based Multi-Agent Systems [22.357102759752234]
This paper introduces a novel framework for simulating and analyzing how uncooperative behaviors can destabilize or collapse multi-agent systems.<n>Our framework includes two key components: (1) a game theory-based taxonomy of uncooperative agent behaviors, and (2) a multi-stage simulation pipeline that dynamically generates and refines uncooperative behaviors as agents' states evolve.
arXiv Detail & Related papers (2025-11-19T20:39:19Z) - Resilient by Design -- Active Inference for Distributed Continuum Intelligence [6.2095604527397485]
This paper introduces a Probabilistic Active Inference Resilience Agent (PAIR-Agent) to achieve resilience in DCC systems.<n>PAIR-Agent performs three core operations: (i) constructing a causal fault graph from device logs, (ii) identifying faults while managing certainties and uncertainties using Markov blankets and the free energy principle, and (iii) autonomously healing issues through active inference.
arXiv Detail & Related papers (2025-11-10T15:30:44Z) - Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails [103.05296856071931]
We identify the Alignment Tipping Process (ATP), a critical post-deployment risk unique to self-evolving Large Language Model (LLM) agents.<n>ATP arises when continual interaction drives agents to abandon alignment constraints established during training in favor of reinforced, self-interested strategies.<n>Our experiments show that alignment benefits erode rapidly under self-evolution, with initially aligned models converging toward unaligned states.
arXiv Detail & Related papers (2025-10-06T14:48:39Z) - An Intelligent Fault Self-Healing Mechanism for Cloud AI Systems via Integration of Large Language Models and Deep Reinforcement Learning [1.1149781202731994]
We propose an Intelligent Fault Self-Healing Mechanism (IFSHM) that integrates Large Language Model (LLM) and Deep Reinforcement Learning (DRL)<n>IFSHM aims to realize a fault recovery framework with semantic understanding and policy optimization capabilities in cloud AI systems.<n> Experimental results on the cloud fault injection platform show that compared with the existing DRL and rule methods, the IFSHM framework shortens the system recovery time by 37% with unknown fault scenarios.
arXiv Detail & Related papers (2025-06-09T04:14:05Z) - DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs [86.76714527437383]
This paper proposes DSMoE, a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks.<n>We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge.<n>Experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches.
arXiv Detail & Related papers (2025-02-18T02:37:26Z) - Cooperative Resilience in Artificial Intelligence Multiagent Systems [2.0608564715600273]
This paper proposes a clear definition of cooperative resilience' and a methodology for its quantitative measurement.
The results highlight the crucial role of resilience metrics in analyzing how the collective system prepares for, resists, recovers from, sustains well-being, and transforms in the face of disruptions.
arXiv Detail & Related papers (2024-09-20T03:28:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.