Adaptive Fault Tolerance Mechanisms of Large Language Models in Cloud Computing Environments
- URL: http://arxiv.org/abs/2503.12228v1
- Date: Sat, 15 Mar 2025 18:45:33 GMT
- Title: Adaptive Fault Tolerance Mechanisms of Large Language Models in Cloud Computing Environments
- Authors: Yihong Jin, Ze Yang, Xinhe Xu, Yihan Zhang, Shuyang Ji,
- Abstract summary: This study proposes a novel adaptive fault tolerance mechanism to ensure the reliability and availability of large-scale language models in cloud computing scenarios.<n>It builds upon known fault-tolerant mechanisms, such as checkpointing, redundancy, and state transposition, introducing dynamic resource allocation and prediction of failure based on real-time performance metrics.
- Score: 5.853391005435494
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rapid evolution of Large Language Models (LLMs) and their large-scale experimentation in cloud-computing spaces, the challenge of guaranteeing their security and efficiency in a failure scenario has become a main issue. To ensure the reliability and availability of large-scale language models in cloud computing scenarios, such as frequent resource failures, network problems, and computational overheads, this study proposes a novel adaptive fault tolerance mechanism. It builds upon known fault-tolerant mechanisms, such as checkpointing, redundancy, and state transposition, introducing dynamic resource allocation and prediction of failure based on real-time performance metrics. The hybrid model integrates data driven deep learning-based anomaly detection technique underlining the contribution of cloud orchestration middleware for predictive prevention of system failures. Additionally, the model integrates adaptive checkpointing and recovery strategies that dynamically adapt according to load and system state to minimize the influence on the performance of the model and minimize downtime. The experimental results demonstrate that the designed model considerably enhances the fault tolerance in large-scale cloud surroundings, and decreases the system downtime by $\mathbf{30\%}$, and has a better modeling availability than the classical fault tolerance mechanism.
Related papers
- Model Hemorrhage and the Robustness Limits of Large Language Models [119.46442117681147]
Large language models (LLMs) demonstrate strong performance across natural language processing tasks, yet undergo significant performance degradation when modified for deployment.
We define this phenomenon as model hemorrhage - performance decline caused by parameter alterations and architectural changes.
arXiv Detail & Related papers (2025-03-31T10:16:03Z) - Improving $(α, f)$-Byzantine Resilience in Federated Learning via layerwise aggregation and cosine distance [7.8973037023478785]
Federated Learning (FL) is a potential solution to data privacy challenges in distributed machine learning.
FL systems remain vulnerable to Byzantine attacks, where malicious nodes contribute corrupted model updates.
This paper introduces Layerwise Cosine Aggregation, a novel aggregation scheme designed to enhance robustness of these rules in high-dimensional settings.
arXiv Detail & Related papers (2025-03-27T08:07:39Z) - A Structured Reasoning Framework for Unbalanced Data Classification Using Probabilistic Models [1.6951945839990796]
The paper studies a Markov network model for unbalanced data, aiming to solve the problems of classification bias and insufficient minority class recognition ability.<n>The experimental results show that the Markov network performs well in indicators such as weighted accuracy, F1 score, and AUC-ROC.<n>Future research can focus on efficient model training, structural optimization, and deep learning integration in large-scale unbalanced data environments.
arXiv Detail & Related papers (2025-02-05T17:20:47Z) - Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge.
Existing methods struggle to balance high model performance with low resource consumption.
We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z) - Expressive and Generalizable Low-rank Adaptation for Large Models via Slow Cascaded Learning [55.5715496559514]
LoRA Slow Cascade Learning (LoRASC) is an innovative technique designed to enhance LoRA's expressiveness and generalization capabilities.
Our approach augments expressiveness through a cascaded learning strategy that enables a mixture-of-low-rank adaptation, thereby increasing the model's ability to capture complex patterns.
arXiv Detail & Related papers (2024-07-01T17:28:59Z) - Towards Robust and Efficient Cloud-Edge Elastic Model Adaptation via Selective Entropy Distillation [56.79064699832383]
We establish a Cloud-Edge Elastic Model Adaptation (CEMA) paradigm in which the edge models only need to perform forward propagation.
In our CEMA, to reduce the communication burden, we devise two criteria to exclude unnecessary samples from uploading to the cloud.
arXiv Detail & Related papers (2024-02-27T08:47:19Z) - Deep autoregressive density nets vs neural ensembles for model-based
offline reinforcement learning [2.9158689853305693]
We consider a model-based reinforcement learning algorithm that infers the system dynamics from the available data and performs policy optimization on imaginary model rollouts.
This approach is vulnerable to exploiting model errors which can lead to catastrophic failures on the real system.
We show that better performance can be obtained with a single well-calibrated autoregressive model on the D4RL benchmark.
arXiv Detail & Related papers (2024-02-05T10:18:15Z) - Three-Stage Adjusted Regression Forecasting (TSARF) for Software Defect
Prediction [5.826476252191368]
Nonhomogeneous Poisson process (NHPP) SRGM are the most commonly employed models.
Increased model complexity presents a challenge in identifying robust and computationally efficient algorithms.
arXiv Detail & Related papers (2024-01-31T02:19:35Z) - Uncovering the Hidden Cost of Model Compression [43.62624133952414]
Visual Prompting has emerged as a pivotal method for transfer learning in computer vision.
Model compression detrimentally impacts the performance of visual prompting-based transfer.
However, negative effects on calibration are not present when models are compressed via quantization.
arXiv Detail & Related papers (2023-08-29T01:47:49Z) - DeepFT: Fault-Tolerant Edge Computing using a Self-Supervised Deep
Surrogate Model [12.335763358698564]
We propose DeepFT to proactively avoid system overloads and their adverse effects.
DeepFT uses a deep surrogate model to accurately predict and diagnose faults in the system.
It offers a highly scalable solution as the model size scales by only 3 and 1 percent per unit increase in the number of active tasks and hosts.
arXiv Detail & Related papers (2022-12-02T16:51:58Z) - On the Generalization and Adaption Performance of Causal Models [99.64022680811281]
Differentiable causal discovery has proposed to factorize the data generating process into a set of modules.
We study the generalization and adaption performance of such modular neural causal models.
Our analysis shows that the modular neural causal models outperform other models on both zero and few-shot adaptation in low data regimes.
arXiv Detail & Related papers (2022-06-09T17:12:32Z) - Federated Learning with Unreliable Clients: Performance Analysis and
Mechanism Design [76.29738151117583]
Federated Learning (FL) has become a promising tool for training effective machine learning models among distributed clients.
However, low quality models could be uploaded to the aggregator server by unreliable clients, leading to a degradation or even a collapse of training.
We model these unreliable behaviors of clients and propose a defensive mechanism to mitigate such a security risk.
arXiv Detail & Related papers (2021-05-10T08:02:27Z) - Bootstrapped model learning and error correction for planning with
uncertainty in model-based RL [1.370633147306388]
A natural aim is to learn a model that reflects accurately the dynamics of the environment.
This paper explores the problem of model misspecification through uncertainty-aware reinforcement learning agents.
We propose a bootstrapped multi-headed neural network that learns the distribution of future states and rewards.
arXiv Detail & Related papers (2020-04-15T15:41:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.