Related papers: Universal Checkpointing: A Flexible and Efficient Distributed Checkpointing System for Large-Scale DNN Training with Reconfigurable Parallelis

Universal Checkpointing: A Flexible and Efficient Distributed Checkpointing System for Large-Scale DNN Training with Reconfigurable Parallelis

URL: http://arxiv.org/abs/2406.18820v3
Date: Fri, 04 Jul 2025 06:16:30 GMT
Title: Universal Checkpointing: A Flexible and Efficient Distributed Checkpointing System for Large-Scale DNN Training with Reconfigurable Parallelis
Authors: Xinyu Lian, Sam Ade Jacobs, Lev Kurilenko, Masahiro Tanaka, Stas Bekman, Olatunji Ruwase, Minjia Zhang,
Abstract summary: Universal Checkpointing (UCP) is a novel checkpointing system for deep neural network (DNN) training.<n>UCP overcomes challenges in existing systems by decoupling checkpoint structure from parallel training strategies and hardware configurations.<n>We present a pattern-based reconfiguration pipeline that enables automatic, flexible, and efficient mapping of checkpoint state to various parallelism strategies.
Score: 16.04816181826873
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep neural network (DNN) training continues to scale rapidly in terms of model size, data volume, and sequence length, to the point where multiple machines are required to fit large models for training. Different distributed and parallel training strategies have been developed to support large-scale DNN training by partitioning the training state across GPUs. However, existing DNN training systems provide very limited support for reconfiguring parallelism strategies in the middle of the training via checkpointing. This limitation arises because distributed checkpoints are tightly coupled to specific model parallelism and hardware configurations, preventing large-scale training jobs from efficiently adapting to hardware failures or resource elasticity. This paper presents Universal Checkpointing (UCP), a novel checkpointing system that enables flexible and efficient DNN training with reconfigurable parallelism. UCP overcomes challenges in existing systems by decoupling checkpoint structure from parallel training strategies and hardware configurations. In addition, we present a pattern-based reconfiguration pipeline that enables automatic, flexible, and efficient mapping of checkpoint state to various parallelism strategies. Evaluation on a range of DNN models, including state-of-the-art dense and sparse LLMs, shows that UCP enables reconfiguration for a broader set of widely used parallelism strategies than existing solutions while adding negligible reconfiguration cost. UCP has been successfully employed in real LLM training workloads, greatly enhancing their flexibility and resilience to dynamic hardware environments.

Related papers

Parameter-Efficient Checkpoint Merging via Metrics-Weighted Averaging [2.9761595094633435]
Checkpoint merging is a technique for combining multiple model snapshots into a single superior model. This paper explores checkpoint merging in the context of parameter-efficient fine-tuning. We propose Metrics-Weighted Averaging (MWA) to merge model checkpoints by weighting their parameters according to performance metrics.
arXiv Detail & Related papers (2025-04-23T05:11:21Z)
Learning Compatible Multi-Prize Subnetworks for Asymmetric Retrieval [62.904384887568284]
Asymmetric retrieval is a typical scenario in real-world retrieval systems.<n>We propose a Prunable Network with self-compatibility, which allows developers to generate compatibleworks at any desired capacity.
arXiv Detail & Related papers (2025-04-16T08:59:47Z)
Test-Time Alignment via Hypothesis Reweighting [56.71167047381817]
Large pretrained models often struggle with underspecified tasks. We propose a novel framework to address the challenge of aligning models to test-time user intent.
arXiv Detail & Related papers (2024-12-11T23:02:26Z)
If You Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs [48.95875673503714]
We study merging "generalist" models trained on many tasks. Our algorithm tunes the weight of each checkpoint in a linear combination, resulting in an optimal model. Good merges tend to include almost all checkpoints with non-zero weights, indicating that even seemingly bad initial checkpoints can contribute to good final merges.
arXiv Detail & Related papers (2024-12-05T13:12:51Z)
Self-Supervised Any-Point Tracking by Contrastive Random Walks [17.50529887238381]
We train a global matching transformer to find cycle consistent tracks through video via contrastive random walks. Our method achieves strong performance on the TapVid benchmarks, outperforming previous self-supervised tracking methods.
arXiv Detail & Related papers (2024-09-24T17:59:56Z)
MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training [4.4345088842995395]
We propose the Mixture-of-Checkpoint System (MoC-System) to orchestrate the vast array of checkpoint shards produced in distributed training systems. MoC-System features a novel Partial Experts Checkpointing (PEC) mechanism, an algorithm-system co-design that strategically saves a selected subset of experts. We build MoC-System upon the Megatron-DeepSpeed framework, achieving up to a 98.9% reduction in overhead for each checkpointing process.
arXiv Detail & Related papers (2024-08-08T08:40:15Z)
ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development [9.13331802151585]
ByteCheckpoint is an industrial-grade checkpointing system for large-scale LFM training. ByteCheckpoint significantly reduces checkpoint stalls, achieving an average reduction of 54.20x. For saving and loading times, ByteCheckpoint achieves improvements of up to 9.96x and 8.80x, respectively.
arXiv Detail & Related papers (2024-07-29T16:18:20Z)
Flextron: Many-in-One Flexible Large Language Model [85.93260172698398]
We introduce Flextron, a network architecture and post-training model optimization framework supporting flexible model deployment. We present a sample-efficient training method and associated routing algorithms for transforming an existing trained LLM into a Flextron model. We demonstrate superior performance over multiple end-to-end trained variants and other state-of-the-art elastic networks, all with a single pretraining run that consumes a mere 7.63% tokens compared to original pretraining.
arXiv Detail & Related papers (2024-06-11T01:16:10Z)
ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment [7.916080032572087]
atom is a resilient distributed training framework designed for asynchronous training of vast models in a decentralized setting. atom aims to accommodate a complete LLM on one host (peer) through seamlessly model swapping and concurrently trains multiple copies across various peers to optimize training throughput. Our experiments using different GPT-3 model configurations reveal that, in scenarios with suboptimal network connections, atom can enhance training efficiency up to $20 times$ when juxtaposed with the state-of-the-art decentralized pipeline parallelism approaches.
arXiv Detail & Related papers (2024-03-15T17:43:43Z)
Submodel Partitioning in Hierarchical Federated Learning: Algorithm Design and Convergence Analysis [15.311309249848739]
Hierarchical learning (FL) has demonstrated promising scalability advantages over the traditional "star-topology" architecture-based federated learning (FL) In this paper, we propose independent sub training overconstrained Internet of Things (IoT) Key idea behind HIST is a global version of model computation, where we partition the global model into disjoint submodels in each round, and distribute them across different cells.
arXiv Detail & Related papers (2023-10-27T04:42:59Z)
DiffComplete: Diffusion-based Generative 3D Shape Completion [114.43353365917015]
We introduce a new diffusion-based approach for shape completion on 3D range scans. We strike a balance between realism, multi-modality, and high fidelity. DiffComplete sets a new SOTA performance on two large-scale 3D shape completion benchmarks.
arXiv Detail & Related papers (2023-06-28T16:07:36Z)
TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement [64.11385310305612]
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence. Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations. The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS.
arXiv Detail & Related papers (2023-06-14T17:07:51Z)
Efficient Implementation of a Multi-Layer Gradient-Free Online-Trainable Spiking Neural Network on FPGA [0.31498833540989407]
ODESA is the first network to have end-to-end multi-layer online local supervised training without using gradients. This research shows that the network architecture and the online training of weights and thresholds can be implemented efficiently on a large scale in hardware.
arXiv Detail & Related papers (2023-05-31T00:34:15Z)
Free Lunch: Robust Cross-Lingual Transfer via Model Checkpoint Averaging [60.79382212029304]
Massively multilingual language models have displayed strong performance in zero-shot (ZS-XLT) and few-shot (FS-XLT) cross-lingual transfer setups. We propose a simple and effective method that averages different checkpoints (i.e., model snapshots) during task fine-tuning.
arXiv Detail & Related papers (2023-05-26T11:24:32Z)
Reconfigurable Distributed FPGA Cluster Design for Deep Learning Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications. The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z)
TAPAS: Fast and Automatic Derivation of Tensor Parallel Strategies for Large Neural Networks [27.634123904734615]
We build an automatic parallelism framework named TAPAS that eliminates redundant search efforts.<n> TAPAS employs a divide-and-linear approach that efficiently folds the search space by identifying those unique substructures.<n>Our evaluations demonstrate that TAPAS outperforms the state-of-the-art automatic parallelism frameworks by up to $160times$ in search speed.
arXiv Detail & Related papers (2023-02-01T05:22:28Z)
SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z)
Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups. We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z)
DistIR: An Intermediate Representation and Simulator for Efficient Neural Network Distribution [15.086401550425125]
DistIR is a representation for distributed computation that is tailored for efficient analyses. We show how DistIR and its simulator enable fast grid searches over complex distribution spaces spanning up to 1000+ configurations.
arXiv Detail & Related papers (2021-11-09T21:32:51Z)
On Model Calibration for Long-Tailed Object Detection and Instance Segmentation [56.82077636126353]
We propose NorCal, Normalized for long-tailed object detection and instance segmentation. We show that separately handling the background class and normalizing the scores over classes for each proposal are keys to achieving superior performance.
arXiv Detail & Related papers (2021-07-05T17:57:20Z)
Parallel Training of Deep Networks with Local Updates [84.30918922367442]
Local parallelism is a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation. We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.
arXiv Detail & Related papers (2020-12-07T16:38:45Z)
Check-N-Run: A Checkpointing System for Training Deep Learning Recommendation Models [5.604501524927757]
We present Check-N-Run, a scalable checkpointing system for training large machine learning models at Facebook. Check-N-Run uses two primary techniques to address the size and bandwidth challenges. These techniques allow Check-N-Run to reduce the required write bandwidth by 6-17x and the required capacity by 2.5-8x on real-world models.
arXiv Detail & Related papers (2020-10-17T00:45:55Z)
UniT: Unified Knowledge Transfer for Any-shot Object Detection and Segmentation [52.487469544343305]
Methods for object detection and segmentation rely on large scale instance-level annotations for training. We propose an intuitive and unified semi-supervised model that is applicable to a range of supervision.
arXiv Detail & Related papers (2020-06-12T22:45:47Z)
A Linear Algebraic Approach to Model Parallelism in Deep Learning [0.0]
Training deep neural networks (DNNs) in large-cluster computing environments is increasingly necessary, as networks grow in size and complexity. We propose a linear-algebraic approach to model parallelism in deep learning, which allows parallel distribution of any tensor in the DNN. We build distributed DNN layers using these parallel primitives, composed with sequential layer implementations, and demonstrate their application by building and training a distributed DNN using DistDL, a PyTorch and MPI-based distributed deep learning toolkit.
arXiv Detail & Related papers (2020-06-04T19:38:05Z)
Large-Scale Gradient-Free Deep Learning with Recursive Local Representation Alignment [84.57874289554839]
Training deep neural networks on large-scale datasets requires significant hardware resources. Backpropagation, the workhorse for training these networks, is an inherently sequential process that is difficult to parallelize. We propose a neuro-biologically-plausible alternative to backprop that can be used to train deep networks.
arXiv Detail & Related papers (2020-02-10T16:20:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.