TrainMover: An Interruption-Resilient and Reliable ML Training Runtime
- URL: http://arxiv.org/abs/2412.12636v2
- Date: Sat, 26 Apr 2025 13:44:28 GMT
- Title: TrainMover: An Interruption-Resilient and Reliable ML Training Runtime
- Authors: ChonLam Lao, Minlan Yu, Aditya Akella, Jiamin Cao, Yu Guan, Pengcheng Zhang, Zhilong Zheng, Yichi Xu, Ennan Zhai, Dennis Cai, Jiaqi Gao,
- Abstract summary: TrainMover is a resilient runtime that leverages standby machines to handle interruptions with minimal downtime and zero memory overhead.<n>Our evaluation shows that TrainMover consistently achieves second-level downtime across all evaluated models during migration, maintaining 99% training efficiency during periodic 10-minute rebalancing.
- Score: 16.38937239546935
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale ML training jobs are frequently interrupted by hardware and software anomalies, failures, and management events. Existing solutions like checkpointing or runtime reconfiguration suffer from long downtimes, degraded performance, or undesired changes to training strategies. We present TrainMover, a resilient runtime that leverages standby machines to handle interruptions with minimal downtime and zero memory overhead. To achieve these goals, TrainMover introduces two key techniques: two-phase, delta-based communication group setups and communication-free sandboxed shadow iterations. Our evaluation shows that TrainMover consistently achieves second-level downtime across all evaluated models during migration, maintaining 99\% training efficiency during periodic 10-minute rebalancing. We also demonstrate the effectiveness of TrainMover in handling various interruptions.
Related papers
- Alchemist: Towards the Design of Efficient Online Continual Learning System [15.224901317189728]
We propose Alchemist, to the best of our knowledge, the first online continual learning system that efficiently reuses serving activations to increase training throughput.
Alchemy significantly increases training throughput by up to 1.72x, reduces up to 47% memory usage during training, and supports up to 2x more training tokens.
arXiv Detail & Related papers (2025-03-03T00:14:34Z) - Exploring the Benefit of Activation Sparsity in Pre-training [117.25661020250658]
We study how activation properties change during pre-training.
We propose Switchable Sparse-Dense Learning (SSD)
SSD achieves comparable performance with identical model size and reduces pre-training costs.
arXiv Detail & Related papers (2024-10-04T13:53:33Z) - ProTrain: Efficient LLM Training via Memory-Aware Techniques [18.30799115938978]
This paper proposes ProTrain, a novel training system that balances memory usage and performance by coordinating memory, computation, and IO.
ProTrain improves training throughput by 1.43$times$ to 2.71$times compared to the SOTA training systems.
arXiv Detail & Related papers (2024-06-12T15:40:06Z) - Unicron: Economizing Self-Healing LLM Training at Scale [43.59768821780751]
We introduce Unicron, a workload manager for efficient self-healing in large-scale language model training.
Unicron minimizes failure-related costs across multiple concurrent tasks within a cluster.
It demonstrates up to a 1.9x improvement in training efficiency over state-of-the-art methods.
arXiv Detail & Related papers (2023-12-30T04:06:16Z) - Efficient Asynchronous Federated Learning with Sparsification and
Quantization [55.6801207905772]
Federated Learning (FL) is attracting more and more attention to collaboratively train a machine learning model without transferring raw data.
FL generally exploits a parameter server and a large number of edge devices during the whole process of the model training.
We propose TEASQ-Fed to exploit edge devices to asynchronously participate in the training process by actively applying for tasks.
arXiv Detail & Related papers (2023-12-23T07:47:07Z) - TRANSOM: An Efficient Fault-Tolerant System for Training LLMs [7.831906758749453]
Large language models (LLMs) with hundreds of billions or trillions of parameters, represented by chatGPT, have achieved profound impact on various fields.
Training LLMs with super-large-scale parameters requires large high-performance GPU clusters and long training periods lasting for months.
To address these issues, we propose TRANSOM, a novel fault-tolerant LLM training system.
arXiv Detail & Related papers (2023-10-16T04:06:52Z) - Fast Machine Unlearning Without Retraining Through Selective Synaptic
Dampening [51.34904967046097]
Selective Synaptic Dampening (SSD) is a fast, performant, and does not require long-term storage of the training data.
We present a novel two-step, post hoc, retrain-free approach to machine unlearning which is fast, performant, and does not require long-term storage of the training data.
arXiv Detail & Related papers (2023-08-15T11:30:45Z) - Curriculum-based Asymmetric Multi-task Reinforcement Learning [14.5357225087828]
We introduce CAMRL, the first curriculum-based asymmetric multi-task learning (AMTL) algorithm for dealing with multiple reinforcement learning (RL) tasks altogether.
To mitigate the negative influence of customizing the one-off training order in curriculum-based AMTL, CAMRL switches its training mode between parallel single-task RL and asymmetric multi-task RL (MTRL)
We have conducted experiments on a wide range of benchmarks in multi-task RL, covering Gym-minigrid, Meta-world, Atari video games, vision-based PyBullet tasks, and RLBench.
arXiv Detail & Related papers (2022-11-07T08:05:13Z) - The Right to be Forgotten in Federated Learning: An Efficient
Realization with Rapid Retraining [22.16510303054159]
We propose a rapid retraining approach to fully erase data samples from a trained FL model.
Our formal convergence and complexity analysis demonstrate that our design can preserve model utility with high efficiency.
arXiv Detail & Related papers (2022-03-14T17:22:40Z) - Efficient Device Scheduling with Multi-Job Federated Learning [64.21733164243781]
We propose a novel multi-job Federated Learning framework to enable the parallel training process of multiple jobs.
We propose a reinforcement learning-based method and a Bayesian optimization-based method to schedule devices for multiple jobs while minimizing the cost.
Our proposed approaches significantly outperform baseline approaches in terms of training time (up to 8.67 times faster) and accuracy (up to 44.6% higher)
arXiv Detail & Related papers (2021-12-11T08:05:11Z) - Continuous Transition: Improving Sample Efficiency for Continuous
Control Problems via MixUp [119.69304125647785]
This paper introduces a concise yet powerful method to construct Continuous Transition.
Specifically, we propose to synthesize new transitions for training by linearly interpolating the consecutive transitions.
To keep the constructed transitions authentic, we also develop a discriminator to guide the construction process automatically.
arXiv Detail & Related papers (2020-11-30T01:20:23Z) - How Important is the Train-Validation Split in Meta-Learning? [155.5088631672781]
A common practice in meta-learning is to perform a train-validation split (emphtrain-val method) where the prior adapts to the task on one split of the data, and the resulting predictor is evaluated on another split.
Despite its prevalence, the importance of the train-validation split is not well understood either in theory or in practice.
We show that the train-train method can indeed outperform the train-val method, on both simulations and real meta-learning tasks.
arXiv Detail & Related papers (2020-10-12T16:48:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.