Training Through Failure: Effects of Data Consistency in Parallel Machine Learning Training
- URL: http://arxiv.org/abs/2406.05546v1
- Date: Sat, 8 Jun 2024 18:31:56 GMT
- Title: Training Through Failure: Effects of Data Consistency in Parallel Machine Learning Training
- Authors: Ray Cao, Sherry Luo, Steve Gan, Sujeeth Jinesh,
- Abstract summary: In this study, we explore the impact of relaxing data consistency in parallel machine learning training during a failure.
Our failure recovery strategies include traditional checkpointing, chain replication, and a novel stateless parameter server approach.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this study, we explore the impact of relaxing data consistency in parallel machine learning training during a failure using various parameter server configurations. Our failure recovery strategies include traditional checkpointing, chain replication (which ensures a backup server takes over in case of failure), and a novel stateless parameter server approach. In the stateless approach, workers continue generating gradient updates even if the parameter server is down, applying these updates once the server is back online. We compare these techniques to a standard checkpointing approach, where the training job is resumed from the latest checkpoint. To assess the resilience and performance of each configuration, we intentionally killed the parameter server during training for each experiment. Our experiment results indicate that the stateless parameter server approach continues to train towards convergence and improves accuracy as much as 10\% in the face of a failure despite using stale weights and gradients. The chain replication and checkpointing techniques demonstrate convergence but suffer from setbacks in accuracy due to restarting from old checkpoints. These results suggest that allowing workers to continue generating updates during server downtime and applying these updates later can effectively improve hardware utilization. Furthermore, despite higher resource usage, the stateless parameter server method incurs similar monetary costs in terms of hardware usage compared to standard checkpointing methods due to the pricing structure of common cloud providers.
Related papers
- Tracezip: Efficient Distributed Tracing via Trace Compression [26.353398496686854]
Distributed tracing serves as a building block in the monitoring and testing of cloud service systems.
Head-based sampling indiscriminately selects requests to trace when they enter the system, which may miss critical events.
tail-based sampling traces all requests and selectively persist the edge-case traces.
We propose Tracezip to enhance the efficiency of distributed tracing via trace compression.
arXiv Detail & Related papers (2025-02-10T10:13:57Z) - Blockchain-enabled Trustworthy Federated Unlearning [50.01101423318312]
Federated unlearning is a promising paradigm for protecting the data ownership of distributed clients.
Existing works require central servers to retain the historical model parameters from distributed clients.
This paper proposes a new blockchain-enabled trustworthy federated unlearning framework.
arXiv Detail & Related papers (2024-01-29T07:04:48Z) - Enhancing Consistency and Mitigating Bias: A Data Replay Approach for
Incremental Learning [100.7407460674153]
Deep learning systems are prone to catastrophic forgetting when learning from a sequence of tasks.
To mitigate the problem, a line of methods propose to replay the data of experienced tasks when learning new tasks.
However, it is not expected in practice considering the memory constraint or data privacy issue.
As a replacement, data-free data replay methods are proposed by inverting samples from the classification model.
arXiv Detail & Related papers (2024-01-12T12:51:12Z) - Better Generative Replay for Continual Federated Learning [20.57194599280318]
Federated learning is a technique that enables a centralized server to learn from distributed clients via communications.
In this paper, we introduce the problem of continual federated learning, where clients incrementally learn new tasks and history data cannot be stored.
We propose our FedCIL model with two simple but effective solutions: model consolidation and consistency enforcement.
arXiv Detail & Related papers (2023-02-25T06:26:56Z) - FIRE: A Failure-Adaptive Reinforcement Learning Framework for Edge Computing Migrations [52.85536740465277]
FIRE is a framework that adapts to rare events by training a RL policy in an edge computing digital twin environment.
We propose ImRE, an importance sampling-based Q-learning algorithm, which samples rare events proportionally to their impact on the value function.
We show that FIRE reduces costs compared to vanilla RL and the greedy baseline in the event of failures.
arXiv Detail & Related papers (2022-09-28T19:49:39Z) - Acceleration of Federated Learning with Alleviated Forgetting in Local
Training [61.231021417674235]
Federated learning (FL) enables distributed optimization of machine learning models while protecting privacy.
We propose FedReg, an algorithm to accelerate FL with alleviated knowledge forgetting in the local training stage.
Our experiments demonstrate that FedReg not only significantly improves the convergence rate of FL, especially when the neural network architecture is deep.
arXiv Detail & Related papers (2022-03-05T02:31:32Z) - Byzantine-robust Federated Learning through Spatial-temporal Analysis of
Local Model Updates [6.758334200305236]
Federated Learning (FL) enables multiple distributed clients (e.g., mobile devices) to collaboratively train a centralized model while keeping the training data locally on the client.
In this paper, we propose to mitigate these failures and attacks from a spatial-temporal perspective.
Specifically, we use a clustering-based method to detect and exclude incorrect updates by leveraging their geometric properties in the parameter space.
arXiv Detail & Related papers (2021-07-03T18:48:11Z) - Federated Learning with Unreliable Clients: Performance Analysis and
Mechanism Design [76.29738151117583]
Federated Learning (FL) has become a promising tool for training effective machine learning models among distributed clients.
However, low quality models could be uploaded to the aggregator server by unreliable clients, leading to a degradation or even a collapse of training.
We model these unreliable behaviors of clients and propose a defensive mechanism to mitigate such a security risk.
arXiv Detail & Related papers (2021-05-10T08:02:27Z) - Dynamic Parameter Allocation in Parameter Servers [74.250687861348]
We propose to integrate dynamic parameter allocation into parameter servers, describe an efficient implementation of such a parameter server called Lapse.
We found that Lapse provides near-linear scaling and can be orders of magnitude faster than existing parameter servers.
arXiv Detail & Related papers (2020-02-03T11:37:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.