Serial Parallel Reliability Redundancy Allocation Optimization for Energy Efficient and Fault Tolerant Cloud Computing
- URL: http://arxiv.org/abs/2404.03665v1
- Date: Fri, 16 Feb 2024 16:46:10 GMT
- Title: Serial Parallel Reliability Redundancy Allocation Optimization for Energy Efficient and Fault Tolerant Cloud Computing
- Authors: Gutha Jaya Krishna,
- Abstract summary: Serial-parallel redundancy is a reliable way to ensure service and systems will be available in cloud computing.
When an error occurs, the inactive copy can step in as a backup right away.
This approach is called parallel redundancy, otherwise known as active-active redundancy.
- Score: 2.61072980439312
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Serial-parallel redundancy is a reliable way to ensure service and systems will be available in cloud computing. That method involves making copies of the same system or program, with only one remaining active. When an error occurs, the inactive copy can step in as a backup right away, this provides continuous performance and uninterrupted operation. This approach is called parallel redundancy, otherwise known as active-active redundancy, and its exceptional when it comes to strategy. It creates duplicates of a system or service that are all running at once. By doing this fault tolerance increases since if one copy fails, the workload can be distributed across any replica thats functioning properly. Reliability allocation depends on features in a system and the availability and fault tolerance you want from it. Serial redundancy or parallel redundancies can be applied to increase the dependability of systems and services. To demonstrate how well this concept works, we looked into fixed serial parallel reliability redundancy allocation issues followed by using an innovative hybrid optimization technique to find the best possible allocation for peak dependability. We then measured our findings against other research.
Related papers
- Walrus: An Efficient Decentralized Storage Network [6.053171723478456]
Walrus is a novel decentralized blob storage system that addresses limitations through multiple technical innovations.<n>RedStuff is a two-dimensional erasure coding protocol that achieves high security with only 4.5x replication factor.<n>Walrus also introduces a novel multi-stage epoch change protocol that efficiently handles storage node churn.
arXiv Detail & Related papers (2025-05-08T16:06:41Z) - RL-TIME: Reinforcement Learning-based Task Replication in Multicore Embedded Systems [6.184592401883041]
Task replication can improve reliability by duplicating a task's execution to handle transient and permanent faults.
Existing design-time methods typically choose the number of replicas based on worst-case conditions.
We present RL-TIME, a reinforcement learning-based approach that dynamically decides the number of replicas according to actual system conditions.
arXiv Detail & Related papers (2025-03-16T22:31:25Z) - Employing Software Diversity in Cloud Microservices to Engineer Reliable and Performant Systems [2.412158290827225]
This work proposes employing software diversity to enhance system reliability and performance simultaneously.
A cornerstone of our work is the derivation of a reliability metric.
The goal is to maintain a higher replica count for more reliable versions while preserving the diversity of versions as much as possible.
arXiv Detail & Related papers (2024-07-10T00:34:39Z) - Digital Twin-Assisted Data-Driven Optimization for Reliable Edge Caching in Wireless Networks [60.54852710216738]
We introduce a novel digital twin-assisted optimization framework, called D-REC, to ensure reliable caching in nextG wireless networks.
By incorporating reliability modules into a constrained decision process, D-REC can adaptively adjust actions, rewards, and states to comply with advantageous constraints.
arXiv Detail & Related papers (2024-06-29T02:40:28Z) - Training Through Failure: Effects of Data Consistency in Parallel Machine Learning Training [0.0]
In this study, we explore the impact of relaxing data consistency in parallel machine learning training during a failure.
Our failure recovery strategies include traditional checkpointing, chain replication, and a novel stateless parameter server approach.
arXiv Detail & Related papers (2024-06-08T18:31:56Z) - A Comprehensive Benchmarking Analysis of Fault Recovery in Stream Processing Frameworks [1.3398445165628463]
This paper provides a comprehensive analysis of fault recovery performance, stability, and recovery time in a cloud-native environment.
Our results indicate that Flink is the most stable and has one of the best fault recovery.
K Kafka Streams shows suitable fault recovery performance and stability, but with higher event latency.
arXiv Detail & Related papers (2024-04-09T10:49:23Z) - On the Role of Server Momentum in Federated Learning [85.54616432098706]
We propose a general framework for server momentum, that (a) covers a large class of momentum schemes that are unexplored in federated learning (FL)
We provide rigorous convergence analysis for the proposed framework.
arXiv Detail & Related papers (2023-12-19T23:56:49Z) - Iterative Sketching for Secure Coded Regression [66.53950020718021]
We propose methods for speeding up distributed linear regression.
Specifically, we randomly rotate the basis of the system of equations and then subsample blocks, to simultaneously secure the information and reduce the dimension of the regression problem.
arXiv Detail & Related papers (2023-08-08T11:10:42Z) - Dual Generator Offline Reinforcement Learning [90.05278061564198]
In offline RL, constraining the learned policy to remain close to the data is essential.
In practice, GAN-based offline RL methods have not performed as well as alternative approaches.
We show that not only does having two generators enable an effective GAN-based offline RL method, but also approximates a support constraint.
arXiv Detail & Related papers (2022-11-02T20:25:18Z) - Learning Mean-Field Control for Delayed Information Load Balancing in
Large Queuing Systems [26.405495663998828]
In this work, we consider a multi-agent load balancing system, with delayed information, consisting of many clients (load balancers) and many parallel queues.
We apply policy gradient reinforcement learning algorithms to find an optimal load balancing solution.
Our approach is scalable but also shows good performance when compared to the state-of-the-art power-of-d variant of the Join-the-Shortest-Queue (JSQ)
arXiv Detail & Related papers (2022-08-09T13:47:19Z) - Layer-Wise Partitioning and Merging for Efficient and Scalable Deep
Learning [16.38731019298993]
We have proposed a novel layer-wise partitioning and merging, forward and backward pass parallel framework to provide better training performance.
The experimental evaluation on real use cases shows that the proposed method outperforms the state-of-the-art approaches in terms of training speed.
arXiv Detail & Related papers (2022-07-22T11:47:34Z) - An Efficient Asynchronous Method for Integrating Evolutionary and
Gradient-based Policy Search [76.73477450555046]
We introduce an Asynchronous Evolution Strategy-Reinforcement Learning (AES-RL) that maximizes the parallel efficiency of ES and integrates it with policy gradient methods.
Specifically, we propose 1) a novel framework to merge ES and DRL asynchronously and 2) various asynchronous update methods that can take all advantages of asynchronism, ES, and DRL.
arXiv Detail & Related papers (2020-12-10T02:30:48Z) - Accelerating Feedforward Computation via Parallel Nonlinear Equation
Solving [106.63673243937492]
Feedforward computation, such as evaluating a neural network or sampling from an autoregressive model, is ubiquitous in machine learning.
We frame the task of feedforward computation as solving a system of nonlinear equations. We then propose to find the solution using a Jacobi or Gauss-Seidel fixed-point method, as well as hybrid methods of both.
Our method is guaranteed to give exactly the same values as the original feedforward computation with a reduced (or equal) number of parallelizable iterations, and hence reduced time given sufficient parallel computing power.
arXiv Detail & Related papers (2020-02-10T10:11:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.