ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development
- URL: http://arxiv.org/abs/2407.20143v2
- Date: Thu, 10 Oct 2024 12:29:09 GMT
- Title: ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development
- Authors: Borui Wan, Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mofan Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, Xin Liu, Chuan Wu,
- Abstract summary: ByteCheckpoint is an industrial-grade checkpointing system for large-scale LFM training.
ByteCheckpoint significantly reduces checkpoint stalls, achieving an average reduction of 54.20x.
For saving and loading times, ByteCheckpoint achieves improvements of up to 9.96x and 8.80x, respectively.
- Score: 9.13331802151585
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Checkpointing to preserve training states is crucial during the development of Large Foundation Models (LFMs), for training resumption upon various failures or changes in GPU resources and parallelism configurations. In addition, saved checkpoints are dispatched to evaluation tasks or transferred across different training stages (e.g., from pre-training to post-training). All these scenarios require resharding distributed checkpoints from one parallelism to another. In production, different LFMs are trained with various frameworks and storage backends, depending on model sizes and training scales. A high-performance checkpointing system is needed to enable efficient checkpoint management at scale. This paper presents ByteCheckpoint, an industrial-grade checkpointing system for large-scale LFM training. ByteCheckpoint employs a parallelism-agnostic checkpoint representation that enables efficient load-time checkpoint resharding. ByteCheckpoint advocates a generic checkpoint saving/loading workflow to accommodate multiple training frameworks and support different storage backends. To ensure high I/O efficiency, we take a full-stack approach to optimize saving/loading plan generation, critical stages of checkpointing pipelines, and irregular tensor processing required by resharding. To guarantee the scalability of ByteCheckpoint in large-scale training, we enhance the storage system to efficiently handle high volumes of checkpointing I/O requests, devise communication optimizations within the checkpointing workflow, and introduce a suite of monitoring tools to analyze performance and detect bottlenecks. Compared to existing open-source checkpointing systems [40, 46], ByteCheckpoint significantly reduces runtime checkpoint stalls, achieving an average reduction of 54.20x. For saving and loading times, ByteCheckpoint achieves improvements of up to 9.96x and 8.80x, respectively.
Related papers
- Parameter-Efficient Checkpoint Merging via Metrics-Weighted Averaging [2.9761595094633435]
Checkpoint merging is a technique for combining multiple model snapshots into a single superior model.
This paper explores checkpoint merging in the context of parameter-efficient fine-tuning.
We propose Metrics-Weighted Averaging (MWA) to merge model checkpoints by weighting their parameters according to performance metrics.
arXiv Detail & Related papers (2025-04-23T05:11:21Z) - Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training [16.04816181826873]
Existing checkpointing approaches seem ill-suited for distributed training.
We propose Universal Checkpointing, a technique that enables efficient checkpoint creation.
Our evaluation demonstrates the effectiveness and generality of Universal Checkpointing on state-of-the-art model architectures.
arXiv Detail & Related papers (2024-06-27T01:28:30Z) - FastPersist: Accelerating Model Checkpointing in Deep Learning [21.308403847800573]
We propose FastPersist to accelerate checkpoint creation in Deep Learning (DL) training.
FastPersist combines three novel techniques: (i) optimizations for faster checkpoint writes to inference, (ii) efficient write parallelism using the available in training environments, and (iii) overlapping checkpointing with independent training computations.
Our evaluation shows that FastPersist creates checkpoints in persistent storage up to 116x faster than baseline, and enables per-iteration checkpointing with negligible overhead.
arXiv Detail & Related papers (2024-06-19T18:31:23Z) - ServerlessLLM: Low-Latency Serverless Inference for Large Language Models [14.754839787728912]
ServerlessLLM is a distributed system designed to support low-latency serverless inference for Large Language Models (LLMs)
By harnessing the substantial near-GPU storage and memory capacities of inference servers, ServerlessLLM achieves effective local checkpoint storage.
Comprehensive evaluations, including microbenchmarks and real-world scenarios, demonstrate that ServerlessLLM dramatically outperforms state-of-the-art serverless systems.
arXiv Detail & Related papers (2024-01-25T17:55:07Z) - Free Lunch: Robust Cross-Lingual Transfer via Model Checkpoint Averaging [60.79382212029304]
Massively multilingual language models have displayed strong performance in zero-shot (ZS-XLT) and few-shot (FS-XLT) cross-lingual transfer setups.
We propose a simple and effective method that averages different checkpoints (i.e., model snapshots) during task fine-tuning.
arXiv Detail & Related papers (2023-05-26T11:24:32Z) - PointFlowHop: Green and Interpretable Scene Flow Estimation from
Consecutive Point Clouds [49.7285297470392]
An efficient 3D scene flow estimation method called PointFlowHop is proposed in this work.
PointFlowHop takes two consecutive point clouds and determines the 3D flow vectors for every point in the first point cloud.
It decomposes the scene flow estimation task into a set of subtasks, including ego-motion compensation, object association and object-wise motion estimation.
arXiv Detail & Related papers (2023-02-27T23:06:01Z) - Asyncval: A Toolkit for Asynchronously Validating Dense Retriever
Checkpoints during Training [26.053028706793587]
A simple strategy to validate deep learning checkpoints is the addition of validation loops to execute during training.
The validation of dense retrievers (DR) checkpoints is not as trivial -- and the addition of validation loops is not efficient.
We propose Asyncval: a Python-based toolkit for efficiently validating DR checkpoints during training.
arXiv Detail & Related papers (2022-02-25T06:07:58Z) - An Adaptive Framework for Learning Unsupervised Depth Completion [59.17364202590475]
We present a method to infer a dense depth map from a color image and associated sparse depth measurements.
We show that regularization and co-visibility are related via the fitness of the model to data and can be unified into a single framework.
arXiv Detail & Related papers (2021-06-06T02:27:55Z) - Parameter-Efficient Transfer Learning with Diff Pruning [108.03864629388404]
diff pruning is a simple approach to enable parameter-efficient transfer learning within the pretrain-finetune framework.
We find that models finetuned with diff pruning can match the performance of fully finetuned baselines on the GLUE benchmark.
arXiv Detail & Related papers (2020-12-14T12:34:01Z) - Ranking Neural Checkpoints [57.27352551718646]
This paper is concerned with ranking pre-trained deep neural networks (DNNs) for the transfer learning to a downstream task.
We establish a neural checkpoint ranking benchmark (NeuCRaB) and study some intuitive ranking measures.
Our results suggest that the linear separability of the features extracted by the checkpoints is a strong indicator of transferability.
arXiv Detail & Related papers (2020-11-23T04:05:46Z) - Check-N-Run: A Checkpointing System for Training Deep Learning
Recommendation Models [5.604501524927757]
We present Check-N-Run, a scalable checkpointing system for training large machine learning models at Facebook.
Check-N-Run uses two primary techniques to address the size and bandwidth challenges.
These techniques allow Check-N-Run to reduce the required write bandwidth by 6-17x and the required capacity by 2.5-8x on real-world models.
arXiv Detail & Related papers (2020-10-17T00:45:55Z) - On Efficient Constructions of Checkpoints [21.965296582303115]
We propose a lossy compression scheme for checkpoint constructions (called LC-Checkpoint)
LC-Checkpoint simultaneously maximizes the compression rate and optimize the recovery speed.
Our experiments show that LC-Checkpoint achieves a compression rate up to $28times$ and recovery speedup up to $5.77times$ over a state-of-the-art algorithm (SCAR)
arXiv Detail & Related papers (2020-09-28T01:20:15Z) - Tracking Performance of Online Stochastic Learners [57.14673504239551]
Online algorithms are popular in large-scale learning settings due to their ability to compute updates on the fly, without the need to store and process data in large batches.
When a constant step-size is used, these algorithms also have the ability to adapt to drifts in problem parameters, such as data or model properties, and track the optimal solution with reasonable accuracy.
We establish a link between steady-state performance derived under stationarity assumptions and the tracking performance of online learners under random walk models.
arXiv Detail & Related papers (2020-04-04T14:16:27Z) - Key Points Estimation and Point Instance Segmentation Approach for Lane
Detection [65.37887088194022]
We propose a traffic line detection method called Point Instance Network (PINet)
The PINet includes several stacked hourglass networks that are trained simultaneously.
The PINet achieves competitive accuracy and false positive on the TuSimple and Culane datasets.
arXiv Detail & Related papers (2020-02-16T15:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.