Related papers: FastPersist: Accelerating Model Checkpointing in Deep Learning

FastPersist: Accelerating Model Checkpointing in Deep Learning

URL: http://arxiv.org/abs/2406.13768v1
Date: Wed, 19 Jun 2024 18:31:23 GMT
Title: FastPersist: Accelerating Model Checkpointing in Deep Learning
Authors: Guanhua Wang, Olatunji Ruwase, Bing Xie, Yuxiong He,
Abstract summary: We propose FastPersist to accelerate checkpoint creation in Deep Learning (DL) training. FastPersist combines three novel techniques: (i) optimizations for faster checkpoint writes to inference, (ii) efficient write parallelism using the available in training environments, and (iii) overlapping checkpointing with independent training computations. Our evaluation shows that FastPersist creates checkpoints in persistent storage up to 116x faster than baseline, and enables per-iteration checkpointing with negligible overhead.
Score: 21.308403847800573
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Model checkpoints are critical Deep Learning (DL) artifacts that enable fault tolerance for training and downstream applications, such as inference. However, writing checkpoints to persistent storage, and other I/O aspects of DL training, are mostly ignored by compute-focused optimization efforts for faster training of rapidly growing models and datasets. Towards addressing this imbalance, we propose FastPersist to accelerate checkpoint creation in DL training. FastPersist combines three novel techniques: (i) NVMe optimizations for faster checkpoint writes to SSDs, (ii) efficient write parallelism using the available SSDs in training environments, and (iii) overlapping checkpointing with independent training computations. Our evaluation using real world dense and sparse DL models shows that FastPersist creates checkpoints in persistent storage up to 116x faster than baseline, and enables per-iteration checkpointing with negligible overhead.

Related papers

Dynamic Sparse Training of Diagonally Sparse Networks [15.13506569122892]
unstructured sparsity often fails to translate into practical speedups on modern hardware.<n>We propose DynaDiag, a novel structured sparse-to-sparse method that performs at par with unstructured sparsity.<n>With 90% linear layers in ViTs, we observe up to a 3.13x speedup in online inference without sacrificing model performance.
arXiv Detail & Related papers (2025-06-13T04:01:34Z)
Flexiffusion: Training-Free Segment-Wise Neural Architecture Search for Efficient Diffusion Models [50.260693393896716]
Diffusion models (DMs) are powerful generative models capable of producing high-fidelity images but constrained by high computational costs.<n>We propose Flexiffusion, a training-free NAS framework that jointly optimize generation schedules and model architectures without modifying pre-trained parameters.<n>Our work pioneers a resource-efficient paradigm for searching high-speed DMs without sacrificing quality.
arXiv Detail & Related papers (2025-06-03T06:02:50Z)
SMART-PC: Skeletal Model Adaptation for Robust Test-Time Training in Point Clouds [18.33878596057853]
Test-Time Training (TTT) has emerged as a promising solution to address distribution shifts in 3D point cloud classification.<n>We introduce SMART-PC, a skeleton-based framework that enhances resilience to corruptions by leveraging the geometric structure of 3D point clouds.
arXiv Detail & Related papers (2025-05-26T06:11:02Z)
Online Dense Point Tracking with Streaming Memory [54.22820729477756]
Dense point tracking is a challenging task requiring the continuous tracking of every point in the initial frame throughout a substantial portion of a video. Recent point tracking algorithms usually depend on sliding windows for indirect information propagation from the first frame to the current one. We present a lightweight and fast model with textbfStreaming memory for dense textbfPOint textbfTracking and online video processing.
arXiv Detail & Related papers (2025-03-09T06:16:49Z)
A Fresh Take on Stale Embeddings: Improving Dense Retriever Training with Corrector Networks [81.2624272756733]
In dense retrieval, deep encoders provide embeddings for both inputs and targets. We train a small parametric corrector network that adjusts stale cached target embeddings. Our approach matches state-of-the-art results even when no target embedding updates are made during training.
arXiv Detail & Related papers (2024-09-03T13:29:13Z)
DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models [3.3484462092188005]
We introduce a lazy asynchronous multi-level approach that takes advantage of the fact that the tensors making up the model and state shards remain immutable for extended periods of time. The results show up to 48$times$ faster checkpointing and 2.2$times$ faster end-to-end training compared with state-of-art checkpointing approaches.
arXiv Detail & Related papers (2024-06-15T18:30:40Z)
Token-wise Influential Training Data Retrieval for Large Language Models [8.42342318438945]
RapidIn is a framework adapting to Large Language Models for estimating the influence of each training data. RapidIn efficiently traverses the cached gradients to estimate the influence within minutes, achieving over a 6,326x speedup.
arXiv Detail & Related papers (2024-05-20T01:57:34Z)
FREE: Faster and Better Data-Free Meta-Learning [77.90126669914324]
Data-Free Meta-Learning (DFML) aims to extract knowledge from a collection of pre-trained models without requiring the original data. We introduce the Faster and Better Data-Free Meta-Learning framework, which contains: (i) a meta-generator for rapidly recovering training tasks from pre-trained models; and (ii) a meta-learner for generalizing to new unseen tasks.
arXiv Detail & Related papers (2024-05-02T03:43:19Z)
Efficient Asynchronous Federated Learning with Sparsification and Quantization [55.6801207905772]
Federated Learning (FL) is attracting more and more attention to collaboratively train a machine learning model without transferring raw data. FL generally exploits a parameter server and a large number of edge devices during the whole process of the model training. We propose TEASQ-Fed to exploit edge devices to asynchronously participate in the training process by actively applying for tasks.
arXiv Detail & Related papers (2023-12-23T07:47:07Z)
Fast Machine Unlearning Without Retraining Through Selective Synaptic Dampening [51.34904967046097]
Selective Synaptic Dampening (SSD) is a fast, performant, and does not require long-term storage of the training data. We present a novel two-step, post hoc, retrain-free approach to machine unlearning which is fast, performant, and does not require long-term storage of the training data.
arXiv Detail & Related papers (2023-08-15T11:30:45Z)
Accelerating Deep Learning Classification with Error-controlled Approximate-key Caching [72.50506500576746]
We propose a novel caching paradigm, that we named approximate-key caching. While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error. We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
arXiv Detail & Related papers (2021-12-13T13:49:11Z)
MEST: Accurate and Fast Memory-Economic Sparse Training Framework on the Edge [72.16021611888165]
This paper proposes a novel Memory-Economic Sparse Training (MEST) framework targeting for accurate and fast execution on edge devices. The proposed MEST framework consists of enhancements by Elastic Mutation (EM) and Soft Memory Bound (&S) Our results suggest that unforgettable examples can be identified in-situ even during the dynamic exploration of sparsity masks.
arXiv Detail & Related papers (2021-10-26T21:15:17Z)
Procrustes: a Dataflow and Accelerator for Sparse Deep Neural Network Training [0.5219568203653523]
We develop a sparse DNN training accelerator that produces pruned models with the same accuracy as dense models without first training, then pruning, and finally retraining, a dense model. Compared to training the equivalent unpruned models using a state-of-the-art DNN accelerator without sparse training support, Procrustes consumes up to 3.26$times$ less energy and offers up to 4$times$ speedup across a range of models, while pruning weights by an order of magnitude and maintaining unpruned accuracy.
arXiv Detail & Related papers (2020-09-23T07:39:55Z)
Improving compute efficacy frontiers with SliceOut [31.864949424541344]
We introduce SliceOut -- a dropout-inspired scheme to train deep learning models faster without impacting final test accuracy. At test time, turning off SliceOut performs an implicit ensembling across a linear number of architectures that preserves test accuracy. This leads to faster processing of large computational workloads overall, and significantly reduce the resulting energy consumption and CO2emissions.
arXiv Detail & Related papers (2020-07-21T15:59:09Z)
Optimizing Memory-Access Patterns for Deep Learning Accelerators [6.931196464448543]
Deep learning (DL) workloads are moving towards accelerators for faster processing and lower cost. Modern DL accelerators are good at handling the large-scale multiply-accumulate operations that dominate DL workloads. It is challenging to make full use of the compute power of an accelerator since the data must be properly staged in a software-managed scratchpad memory. This paper proposes a systematic approach which leverages the polyhedral model to analyze all operators of a DL model together to minimize the number of memory accesses.
arXiv Detail & Related papers (2020-02-27T05:06:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.