DELTA: Dynamically Optimizing GPU Memory beyond Tensor Recomputation
- URL: http://arxiv.org/abs/2203.15980v1
- Date: Wed, 30 Mar 2022 01:40:25 GMT
- Title: DELTA: Dynamically Optimizing GPU Memory beyond Tensor Recomputation
- Authors: Yu Tang, Chenyu Wang, Yufan Zhang, Yuliang Liu, Xingcheng Zhang, Linbo
Qiao, Zhiquan Lai, Dongsheng Li
- Abstract summary: We propose a novel scheduler named DELTA for tensor swapping and tensor recomputation.
We show that DELTA not only saves 40%-70% of GPU memory, surpassing the state-of-the-art method to a great extent.
- Score: 29.804356645683463
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The further development of deep neural networks is hampered by the limited
GPU memory resource. Therefore, the optimization of GPU memory resources is
highly demanded. Swapping and recomputation are commonly applied to make better
use of GPU memory in deep learning. However, as an emerging domain, several
challenges remain:1)The efficiency of recomputation is limited for both static
and dynamic methods. 2)Swapping requires offloading parameters manually, which
incurs a great time cost. 3) There is no such dynamic and fine-grained method
that involves tensor swapping together with tensor recomputation nowadays. To
remedy the above issues, we propose a novel scheduler manager named
DELTA(Dynamic tEnsor offLoad and recompuTAtion). To the best of our knowledge,
we are the first to make a reasonable dynamic runtime scheduler on the
combination of tensor swapping and tensor recomputation without user oversight.
In DELTA, we propose a filter algorithm to select the optimal tensors to be
released out of GPU memory and present a director algorithm to select a proper
action for each of these tensors. Furthermore, prefetching and overlapping are
deliberately considered to overcome the time cost caused by swapping and
recomputing tensors. Experimental results show that DELTA not only saves
40%-70% of GPU memory, surpassing the state-of-the-art method to a great extent
but also gets comparable convergence results as the baseline with acceptable
time delay. Also, DELTA gains 2.04$\times$ maximum batchsize when training
ResNet-50 and 2.25$\times$ when training ResNet-101 compared with the baseline.
Besides, comparisons between the swapping cost and recomputation cost in our
experiments demonstrate the importance of making a reasonable dynamic scheduler
on tensor swapping and tensor recomputation, which refutes the arguments in
some related work that swapping should be the first and best choice.
Related papers
- FTuner: A Fast Dynamic Shape Tensors Program Auto-Tuner for Deep Learning Compilers [6.194917248699324]
This paper proposes a new technique for deep learning compilers called FTuner.
Experiments show that the FTuner can achieve comparable operators and end-to-end performance to vendor libraries.
arXiv Detail & Related papers (2024-07-31T08:05:33Z) - CoMERA: Computing- and Memory-Efficient Training via Rank-Adaptive Tensor Optimization [10.319009303849109]
Training large AI models such as deep learning recommendation systems and foundation language (or multi-modal) models costs massive GPU and computing time.
CoMERA achieves end-to-end rank-adaptive tensor-compressed training via a multi-objective optimization formulation.
CoMERA is $2times$ faster per training epoch and $9times$ more memory-efficient than GaLore on a tested six-encoder transformer with single-batch training.
arXiv Detail & Related papers (2024-05-23T09:52:15Z) - Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone.
We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z) - FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency
Trade-off in Language Model Inference [57.119047493787185]
This paper shows how to reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
In practice, our method can reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
arXiv Detail & Related papers (2024-01-08T17:29:16Z) - Coop: Memory is not a Commodity [0.9667631210393929]
tensor rematerialization allows the training of deep neural networks (DNNs) under limited memory budgets.
We propose to evict tensors within a sliding window to ensure all evictions are contiguous and are immediately used.
We also propose cheap tensor partitioning and recomputable in-place to further reduce the rematerialization cost.
arXiv Detail & Related papers (2023-11-01T15:35:51Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z) - Beyond Lazy Training for Over-parameterized Tensor Decomposition [69.4699995828506]
We show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
arXiv Detail & Related papers (2020-10-22T00:32:12Z) - Towards Compact Neural Networks via End-to-End Training: A Bayesian
Tensor Approach with Automatic Rank Determination [11.173092834726528]
It is desirable to directly train a compact neural network from scratch with low memory and low computational cost.
Low-rank tensor decomposition is one of the most effective approaches to reduce the memory and computing requirements of large-size neural networks.
This paper presents a novel end-to-end framework for low-rank tensorized training of neural networks.
arXiv Detail & Related papers (2020-10-17T01:23:26Z) - Improving compute efficacy frontiers with SliceOut [31.864949424541344]
We introduce SliceOut -- a dropout-inspired scheme to train deep learning models faster without impacting final test accuracy.
At test time, turning off SliceOut performs an implicit ensembling across a linear number of architectures that preserves test accuracy.
This leads to faster processing of large computational workloads overall, and significantly reduce the resulting energy consumption and CO2emissions.
arXiv Detail & Related papers (2020-07-21T15:59:09Z) - AdamP: Slowing Down the Slowdown for Momentum Optimizers on
Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning.
It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights.
In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.