Related papers: Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training

Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training

URL: http://arxiv.org/abs/2406.18820v2
Date: Fri, 28 Jun 2024 02:33:11 GMT
Title: Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training
Authors: Xinyu Lian, Sam Ade Jacobs, Lev Kurilenko, Masahiro Tanaka, Stas Bekman, Olatunji Ruwase, Minjia Zhang,
Abstract summary: Existing checkpointing approaches seem ill-suited for distributed training. We propose Universal Checkpointing, a technique that enables efficient checkpoint creation. Our evaluation demonstrates the effectiveness and generality of Universal Checkpointing on state-of-the-art model architectures.
Score: 16.04816181826873
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing checkpointing approaches seem ill-suited for distributed training even though hardware limitations make model parallelism, i.e., sharding model state across multiple accelerators, a requirement for model scaling. Consolidating distributed model state into a single checkpoint unacceptably slows down training, and is impractical at extreme scales. Distributed checkpoints, in contrast, are tightly coupled to the model parallelism and hardware configurations of the training run, and thus unusable on different configurations. To address this problem, we propose Universal Checkpointing, a technique that enables efficient checkpoint creation while providing the flexibility of resuming on arbitrary parallelism strategy and hardware configurations. Universal Checkpointing unlocks unprecedented capabilities for large-scale training such as improved resilience to hardware failures through continued training on remaining healthy hardware, and reduced training time through opportunistic exploitation of elastic capacity. The key insight of Universal Checkpointing is the selection of the optimal representation in each phase of the checkpointing life cycle: distributed representation for saving, and consolidated representation for loading. This is achieved using two key mechanisms. First, the universal checkpoint format, which consists of a consolidated representation of each model parameter and metadata for mapping parameter fragments into training ranks of arbitrary model-parallelism configuration. Second, the universal checkpoint language, a simple but powerful specification language for converting distributed checkpoints into the universal checkpoint format. Our evaluation demonstrates the effectiveness and generality of Universal Checkpointing on state-of-the-art model architectures and a wide range of parallelism techniques.

Related papers

Parameter-Efficient Checkpoint Merging via Metrics-Weighted Averaging [2.9761595094633435]
Checkpoint merging is a technique for combining multiple model snapshots into a single superior model. This paper explores checkpoint merging in the context of parameter-efficient fine-tuning. We propose Metrics-Weighted Averaging (MWA) to merge model checkpoints by weighting their parameters according to performance metrics.
arXiv Detail & Related papers (2025-04-23T05:11:21Z)
Learning Compatible Multi-Prize Subnetworks for Asymmetric Retrieval [62.904384887568284]
Asymmetric retrieval is a typical scenario in real-world retrieval systems.<n>We propose a Prunable Network with self-compatibility, which allows developers to generate compatibleworks at any desired capacity.
arXiv Detail & Related papers (2025-04-16T08:59:47Z)
Test-Time Alignment via Hypothesis Reweighting [56.71167047381817]
Large pretrained models often struggle with underspecified tasks. We propose a novel framework to address the challenge of aligning models to test-time user intent.
arXiv Detail & Related papers (2024-12-11T23:02:26Z)
If You Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs [48.95875673503714]
We study merging "generalist" models trained on many tasks. Our algorithm tunes the weight of each checkpoint in a linear combination, resulting in an optimal model. Good merges tend to include almost all checkpoints with non-zero weights, indicating that even seemingly bad initial checkpoints can contribute to good final merges.
arXiv Detail & Related papers (2024-12-05T13:12:51Z)
Self-Supervised Any-Point Tracking by Contrastive Random Walks [17.50529887238381]
We train a global matching transformer to find cycle consistent tracks through video via contrastive random walks. Our method achieves strong performance on the TapVid benchmarks, outperforming previous self-supervised tracking methods.
arXiv Detail & Related papers (2024-09-24T17:59:56Z)
MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training [4.4345088842995395]
We propose the Mixture-of-Checkpoint System (MoC-System) to orchestrate the vast array of checkpoint shards produced in distributed training systems. MoC-System features a novel Partial Experts Checkpointing (PEC) mechanism, an algorithm-system co-design that strategically saves a selected subset of experts. We build MoC-System upon the Megatron-DeepSpeed framework, achieving up to a 98.9% reduction in overhead for each checkpointing process.
arXiv Detail & Related papers (2024-08-08T08:40:15Z)
ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development [9.13331802151585]
ByteCheckpoint is an industrial-grade checkpointing system for large-scale LFM training. ByteCheckpoint significantly reduces checkpoint stalls, achieving an average reduction of 54.20x. For saving and loading times, ByteCheckpoint achieves improvements of up to 9.96x and 8.80x, respectively.
arXiv Detail & Related papers (2024-07-29T16:18:20Z)
Flextron: Many-in-One Flexible Large Language Model [85.93260172698398]
We introduce Flextron, a network architecture and post-training model optimization framework supporting flexible model deployment. We present a sample-efficient training method and associated routing algorithms for transforming an existing trained LLM into a Flextron model. We demonstrate superior performance over multiple end-to-end trained variants and other state-of-the-art elastic networks, all with a single pretraining run that consumes a mere 7.63% tokens compared to original pretraining.
arXiv Detail & Related papers (2024-06-11T01:16:10Z)
ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment [7.916080032572087]
atom is a resilient distributed training framework designed for asynchronous training of vast models in a decentralized setting. atom aims to accommodate a complete LLM on one host (peer) through seamlessly model swapping and concurrently trains multiple copies across various peers to optimize training throughput. Our experiments using different GPT-3 model configurations reveal that, in scenarios with suboptimal network connections, atom can enhance training efficiency up to $20 times$ when juxtaposed with the state-of-the-art decentralized pipeline parallelism approaches.
arXiv Detail & Related papers (2024-03-15T17:43:43Z)
Submodel Partitioning in Hierarchical Federated Learning: Algorithm Design and Convergence Analysis [15.311309249848739]
Hierarchical learning (FL) has demonstrated promising scalability advantages over the traditional "star-topology" architecture-based federated learning (FL) In this paper, we propose independent sub training overconstrained Internet of Things (IoT) Key idea behind HIST is a global version of model computation, where we partition the global model into disjoint submodels in each round, and distribute them across different cells.
arXiv Detail & Related papers (2023-10-27T04:42:59Z)
DiffComplete: Diffusion-based Generative 3D Shape Completion [114.43353365917015]
We introduce a new diffusion-based approach for shape completion on 3D range scans. We strike a balance between realism, multi-modality, and high fidelity. DiffComplete sets a new SOTA performance on two large-scale 3D shape completion benchmarks.
arXiv Detail & Related papers (2023-06-28T16:07:36Z)
TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement [64.11385310305612]
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence. Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations. The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS.
arXiv Detail & Related papers (2023-06-14T17:07:51Z)
Efficient Implementation of a Multi-Layer Gradient-Free Online-Trainable Spiking Neural Network on FPGA [0.31498833540989407]
ODESA is the first network to have end-to-end multi-layer online local supervised training without using gradients. This research shows that the network architecture and the online training of weights and thresholds can be implemented efficiently on a large scale in hardware.
arXiv Detail & Related papers (2023-05-31T00:34:15Z)
Free Lunch: Robust Cross-Lingual Transfer via Model Checkpoint Averaging [60.79382212029304]
Massively multilingual language models have displayed strong performance in zero-shot (ZS-XLT) and few-shot (FS-XLT) cross-lingual transfer setups. We propose a simple and effective method that averages different checkpoints (i.e., model snapshots) during task fine-tuning.
arXiv Detail & Related papers (2023-05-26T11:24:32Z)
Reconfigurable Distributed FPGA Cluster Design for Deep Learning Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications. The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z)
TAPAS: Fast and Automatic Derivation of Tensor Parallel Strategies for Large Neural Networks [27.634123904734615]
We build an automatic parallelism framework named TAPAS that eliminates redundant search efforts.<n> TAPAS employs a divide-and-linear approach that efficiently folds the search space by identifying those unique substructures.<n>Our evaluations demonstrate that TAPAS outperforms the state-of-the-art automatic parallelism frameworks by up to $160times$ in search speed.
arXiv Detail & Related papers (2023-02-01T05:22:28Z)
SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z)
Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups. We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z)
DistIR: An Intermediate Representation and Simulator for Efficient Neural Network Distribution [15.086401550425125]
DistIR is a representation for distributed computation that is tailored for efficient analyses. We show how DistIR and its simulator enable fast grid searches over complex distribution spaces spanning up to 1000+ configurations.
arXiv Detail & Related papers (2021-11-09T21:32:51Z)
On Model Calibration for Long-Tailed Object Detection and Instance Segmentation [56.82077636126353]
We propose NorCal, Normalized for long-tailed object detection and instance segmentation. We show that separately handling the background class and normalizing the scores over classes for each proposal are keys to achieving superior performance.
arXiv Detail & Related papers (2021-07-05T17:57:20Z)
Parallel Training of Deep Networks with Local Updates [84.30918922367442]
Local parallelism is a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation. We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.
arXiv Detail & Related papers (2020-12-07T16:38:45Z)
Check-N-Run: A Checkpointing System for Training Deep Learning Recommendation Models [5.604501524927757]
We present Check-N-Run, a scalable checkpointing system for training large machine learning models at Facebook. Check-N-Run uses two primary techniques to address the size and bandwidth challenges. These techniques allow Check-N-Run to reduce the required write bandwidth by 6-17x and the required capacity by 2.5-8x on real-world models.
arXiv Detail & Related papers (2020-10-17T00:45:55Z)
UniT: Unified Knowledge Transfer for Any-shot Object Detection and Segmentation [52.487469544343305]
Methods for object detection and segmentation rely on large scale instance-level annotations for training. We propose an intuitive and unified semi-supervised model that is applicable to a range of supervision.
arXiv Detail & Related papers (2020-06-12T22:45:47Z)
A Linear Algebraic Approach to Model Parallelism in Deep Learning [0.0]
Training deep neural networks (DNNs) in large-cluster computing environments is increasingly necessary, as networks grow in size and complexity. We propose a linear-algebraic approach to model parallelism in deep learning, which allows parallel distribution of any tensor in the DNN. We build distributed DNN layers using these parallel primitives, composed with sequential layer implementations, and demonstrate their application by building and training a distributed DNN using DistDL, a PyTorch and MPI-based distributed deep learning toolkit.
arXiv Detail & Related papers (2020-06-04T19:38:05Z)
Large-Scale Gradient-Free Deep Learning with Recursive Local Representation Alignment [84.57874289554839]
Training deep neural networks on large-scale datasets requires significant hardware resources. Backpropagation, the workhorse for training these networks, is an inherently sequential process that is difficult to parallelize. We propose a neuro-biologically-plausible alternative to backprop that can be used to train deep networks.
arXiv Detail & Related papers (2020-02-10T16:20:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.