Check-N-Run: A Checkpointing System for Training Deep Learning
Recommendation Models
- URL: http://arxiv.org/abs/2010.08679v2
- Date: Tue, 4 May 2021 17:36:01 GMT
- Title: Check-N-Run: A Checkpointing System for Training Deep Learning
Recommendation Models
- Authors: Assaf Eisenman, Kiran Kumar Matam, Steven Ingram, Dheevatsa Mudigere,
Raghuraman Krishnamoorthi, Krishnakumar Nair, Misha Smelyanskiy, Murali
Annavaram
- Abstract summary: We present Check-N-Run, a scalable checkpointing system for training large machine learning models at Facebook.
Check-N-Run uses two primary techniques to address the size and bandwidth challenges.
These techniques allow Check-N-Run to reduce the required write bandwidth by 6-17x and the required capacity by 2.5-8x on real-world models.
- Score: 5.604501524927757
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Checkpoints play an important role in training long running machine learning
(ML) models. Checkpoints take a snapshot of an ML model and store it in a
non-volatile memory so that they can be used to recover from failures to ensure
rapid training progress. In addition, they are used for online training to
improve inference prediction accuracy with continuous learning. Given the large
and ever increasing model sizes, checkpoint frequency is often bottlenecked by
the storage write bandwidth and capacity. When checkpoints are maintained on
remote storage, as is the case with many industrial settings, they are also
bottlenecked by network bandwidth. We present Check-N-Run, a scalable
checkpointing system for training large ML models at Facebook. While
Check-N-Run is applicable to long running ML jobs, we focus on checkpointing
recommendation models which are currently the largest ML models with Terabytes
of model size. Check-N-Run uses two primary techniques to address the size and
bandwidth challenges. First, it applies incremental checkpointing, which tracks
and checkpoints the modified part of the model. Incremental checkpointing is
particularly valuable in the context of recommendation models where only a
fraction of the model (stored as embedding tables) is updated on each
iteration. Second, Check-N-Run leverages quantization techniques to
significantly reduce the checkpoint size, without degrading training accuracy.
These techniques allow Check-N-Run to reduce the required write bandwidth by
6-17x and the required capacity by 2.5-8x on real-world models at Facebook, and
thereby significantly improve checkpoint capabilities while reducing the total
cost of ownership.
Related papers
- Test-Time Training Done Right [61.8429380523577]
Test-Time Training (TTT) models context by adapting part of the model's weights (referred to as fast weights) during inference.<n>Existing TTT methods struggled to show effectiveness in handling long-context data.<n>We develop Large Chunk Test-Time Training (LaCT) which improves hardware utilization by orders of magnitude.
arXiv Detail & Related papers (2025-05-29T17:50:34Z) - Parameter-Efficient Checkpoint Merging via Metrics-Weighted Averaging [2.9761595094633435]
Checkpoint merging is a technique for combining multiple model snapshots into a single superior model.
This paper explores checkpoint merging in the context of parameter-efficient fine-tuning.
We propose Metrics-Weighted Averaging (MWA) to merge model checkpoints by weighting their parameters according to performance metrics.
arXiv Detail & Related papers (2025-04-23T05:11:21Z) - 2 OLMo 2 Furious [126.72656187302502]
OLMo 2 includes dense autoregressive models with improved architecture and training recipe.
Our updated pretraining data mixture introduces a new, specialized data mix called Dolmino Mix 1124.
Our fully open OLMo 2-Instruct models are competitive with or surpassing open-weight only models of comparable size.
arXiv Detail & Related papers (2024-12-31T21:55:10Z) - ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development [9.13331802151585]
ByteCheckpoint is an industrial-grade checkpointing system for large-scale LFM training.
ByteCheckpoint significantly reduces checkpoint stalls, achieving an average reduction of 54.20x.
For saving and loading times, ByteCheckpoint achieves improvements of up to 9.96x and 8.80x, respectively.
arXiv Detail & Related papers (2024-07-29T16:18:20Z) - Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training [16.04816181826873]
Existing checkpointing approaches seem ill-suited for distributed training.
We propose Universal Checkpointing, a technique that enables efficient checkpoint creation.
Our evaluation demonstrates the effectiveness and generality of Universal Checkpointing on state-of-the-art model architectures.
arXiv Detail & Related papers (2024-06-27T01:28:30Z) - ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking [39.02269424136506]
Large language models (LLM) have recently attracted significant attention in the field of artificial intelligence.
We propose a novel Extreme Checkpoint Compression (ExCP) framework, which significantly reduces the required storage of training checkpoints.
We extensively evaluate our proposed ExCP framework on several models ranging from 410M to 7B parameters and demonstrate significant storage reduction while maintaining strong performance.
arXiv Detail & Related papers (2024-06-17T06:47:29Z) - RepCNN: Micro-sized, Mighty Models for Wakeword Detection [3.4888176891918654]
Always-on machine learning models require a very low memory and compute footprint.
We show that a small convolutional model can be better trained by first its computation into a larger multi-branched architecture.
We show that our always-on wake-word detector model, RepCNN, provides a good trade-off between latency and accuracy during inference.
arXiv Detail & Related papers (2024-06-04T16:14:19Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Instance-aware Dynamic Prompt Tuning for Pre-trained Point Cloud Models [64.49254199311137]
We propose a novel Instance-aware Dynamic Prompt Tuning (IDPT) strategy for pre-trained point cloud models.
The essence of IDPT is to develop a dynamic prompt generation module to perceive semantic prior features of each point cloud instance.
In experiments, IDPT outperforms full fine-tuning in most tasks with a mere 7% of the trainable parameters.
arXiv Detail & Related papers (2023-04-14T16:03:09Z) - Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints [59.39280540478479]
We propose sparse upcycling -- a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint.
We show that sparsely upcycled T5 Base, Large, and XL language models and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on SuperGLUE and ImageNet.
arXiv Detail & Related papers (2022-12-09T18:57:37Z) - Fast Yet Effective Machine Unlearning [6.884272840652062]
We introduce a novel machine unlearning framework with error-maximizing noise generation and impair-repair based weight manipulation.
We show excellent unlearning while substantially retaining the overall model accuracy.
This work is an important step towards fast and easy implementation of unlearning in deep networks.
arXiv Detail & Related papers (2021-11-17T07:29:24Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - Towards Efficient Post-training Quantization of Pre-trained Language
Models [85.68317334241287]
We study post-training quantization(PTQ) of PLMs, and propose module-wise quantization error minimization(MREM), an efficient solution to mitigate these issues.
Experiments on GLUE and SQuAD benchmarks show that our proposed PTQ solution not only performs close to QAT, but also enjoys significant reductions in training time, memory overhead, and data consumption.
arXiv Detail & Related papers (2021-09-30T12:50:06Z) - When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable.
In order to achieve a better accuracy, we propose two lightweight modules.
DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers.
QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.