Related papers: Training Through Failure: Effects of Data Consistency in Parallel Machine Learning Training

Training Through Failure: Effects of Data Consistency in Parallel Machine Learning Training

URL: http://arxiv.org/abs/2406.05546v1
Date: Sat, 8 Jun 2024 18:31:56 GMT
Title: Training Through Failure: Effects of Data Consistency in Parallel Machine Learning Training
Authors: Ray Cao, Sherry Luo, Steve Gan, Sujeeth Jinesh,
Abstract summary: In this study, we explore the impact of relaxing data consistency in parallel machine learning training during a failure. Our failure recovery strategies include traditional checkpointing, chain replication, and a novel stateless parameter server approach.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this study, we explore the impact of relaxing data consistency in parallel machine learning training during a failure using various parameter server configurations. Our failure recovery strategies include traditional checkpointing, chain replication (which ensures a backup server takes over in case of failure), and a novel stateless parameter server approach. In the stateless approach, workers continue generating gradient updates even if the parameter server is down, applying these updates once the server is back online. We compare these techniques to a standard checkpointing approach, where the training job is resumed from the latest checkpoint. To assess the resilience and performance of each configuration, we intentionally killed the parameter server during training for each experiment. Our experiment results indicate that the stateless parameter server approach continues to train towards convergence and improves accuracy as much as 10\% in the face of a failure despite using stale weights and gradients. The chain replication and checkpointing techniques demonstrate convergence but suffer from setbacks in accuracy due to restarting from old checkpoints. These results suggest that allowing workers to continue generating updates during server downtime and applying these updates later can effectively improve hardware utilization. Furthermore, despite higher resource usage, the stateless parameter server method incurs similar monetary costs in terms of hardware usage compared to standard checkpointing methods due to the pricing structure of common cloud providers.

Related papers

Tracezip: Efficient Distributed Tracing via Trace Compression [26.353398496686854]
Distributed tracing serves as a fundamental building block in the monitoring and testing of cloud service systems. Head-based sampling indiscriminately selects requests to trace when they enter the system, which may miss critical events. tail-based sampling first captures all requests and then selectively persists the edge-case traces. We propose Tracezip to enhance the efficiency of distributed tracing via trace compression.
arXiv Detail & Related papers (2025-02-10T10:13:57Z)
Blockchain-enabled Trustworthy Federated Unlearning [50.01101423318312]
Federated unlearning is a promising paradigm for protecting the data ownership of distributed clients. Existing works require central servers to retain the historical model parameters from distributed clients. This paper proposes a new blockchain-enabled trustworthy federated unlearning framework.
arXiv Detail & Related papers (2024-01-29T07:04:48Z)
Enhancing Consistency and Mitigating Bias: A Data Replay Approach for Incremental Learning [100.7407460674153]
Deep learning systems are prone to catastrophic forgetting when learning from a sequence of tasks. To mitigate the problem, a line of methods propose to replay the data of experienced tasks when learning new tasks. However, it is not expected in practice considering the memory constraint or data privacy issue. As a replacement, data-free data replay methods are proposed by inverting samples from the classification model.
arXiv Detail & Related papers (2024-01-12T12:51:12Z)
Efficient Asynchronous Federated Learning with Sparsification and Quantization [55.6801207905772]
Federated Learning (FL) is attracting more and more attention to collaboratively train a machine learning model without transferring raw data. FL generally exploits a parameter server and a large number of edge devices during the whole process of the model training. We propose TEASQ-Fed to exploit edge devices to asynchronously participate in the training process by actively applying for tasks.
arXiv Detail & Related papers (2023-12-23T07:47:07Z)
Better Generative Replay for Continual Federated Learning [20.57194599280318]
Federated learning is a technique that enables a centralized server to learn from distributed clients via communications. In this paper, we introduce the problem of continual federated learning, where clients incrementally learn new tasks and history data cannot be stored. We propose our FedCIL model with two simple but effective solutions: model consolidation and consistency enforcement.
arXiv Detail & Related papers (2023-02-25T06:26:56Z)
FIRE: A Failure-Adaptive Reinforcement Learning Framework for Edge Computing Migrations [52.85536740465277]
FIRE is a framework that adapts to rare events by training a RL policy in an edge computing digital twin environment. We propose ImRE, an importance sampling-based Q-learning algorithm, which samples rare events proportionally to their impact on the value function. We show that FIRE reduces costs compared to vanilla RL and the greedy baseline in the event of failures.
arXiv Detail & Related papers (2022-09-28T19:49:39Z)
Improving the Robustness of Federated Learning for Severely Imbalanced Datasets [11.498089180181365]
Two common approaches to achieve this distributed learning is synchronous and asynchronous weight update. It has been seen that with an increasing number of worker nodes, the performance degrades drastically. This effect has been studied in the context of extreme imbalanced classification.
arXiv Detail & Related papers (2022-04-28T11:23:42Z)
Acceleration of Federated Learning with Alleviated Forgetting in Local Training [61.231021417674235]
Federated learning (FL) enables distributed optimization of machine learning models while protecting privacy. We propose FedReg, an algorithm to accelerate FL with alleviated knowledge forgetting in the local training stage. Our experiments demonstrate that FedReg not only significantly improves the convergence rate of FL, especially when the neural network architecture is deep.
arXiv Detail & Related papers (2022-03-05T02:31:32Z)
Byzantine-robust Federated Learning through Spatial-temporal Analysis of Local Model Updates [6.758334200305236]
Federated Learning (FL) enables multiple distributed clients (e.g., mobile devices) to collaboratively train a centralized model while keeping the training data locally on the client. In this paper, we propose to mitigate these failures and attacks from a spatial-temporal perspective. Specifically, we use a clustering-based method to detect and exclude incorrect updates by leveraging their geometric properties in the parameter space.
arXiv Detail & Related papers (2021-07-03T18:48:11Z)
Federated Learning with Unreliable Clients: Performance Analysis and Mechanism Design [76.29738151117583]
Federated Learning (FL) has become a promising tool for training effective machine learning models among distributed clients. However, low quality models could be uploaded to the aggregator server by unreliable clients, leading to a degradation or even a collapse of training. We model these unreliable behaviors of clients and propose a defensive mechanism to mitigate such a security risk.
arXiv Detail & Related papers (2021-05-10T08:02:27Z)
Dynamic Parameter Allocation in Parameter Servers [74.250687861348]
We propose to integrate dynamic parameter allocation into parameter servers, describe an efficient implementation of such a parameter server called Lapse. We found that Lapse provides near-linear scaling and can be orders of magnitude faster than existing parameter servers.
arXiv Detail & Related papers (2020-02-03T11:37:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.