Related papers: DropIT: Dropping Intermediate Tensors for Memory-Efficient DNN Training

DropIT: Dropping Intermediate Tensors for Memory-Efficient DNN Training

URL: http://arxiv.org/abs/2202.13808v1
Date: Mon, 28 Feb 2022 14:12:00 GMT
Title: DropIT: Dropping Intermediate Tensors for Memory-Efficient DNN Training
Authors: Joya Chen, Kai Xu, Yifei Cheng, Angela Yao
Abstract summary: A standard hardware bottleneck when training deep neural networks is GPU memory. We propose a novel method to reduce this footprint by selecting and caching part of intermediate tensors for gradient computation. Experiments show that we can drop up to 90% of the elements of the intermediate tensors in convolutional and fully-connected layers, saving 20% GPU memory during training.
Score: 29.02792751614279
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A standard hardware bottleneck when training deep neural networks is GPU memory. The bulk of memory is occupied by caching intermediate tensors for gradient computation in the backward pass. We propose a novel method to reduce this footprint by selecting and caching part of intermediate tensors for gradient computation. Our Intermediate Tensor Drop method (DropIT) adaptively drops components of the intermediate tensors and recovers sparsified tensors from the remaining elements in the backward pass to compute the gradient. Experiments show that we can drop up to 90% of the elements of the intermediate tensors in convolutional and fully-connected layers, saving 20% GPU memory during training while achieving higher test accuracy for standard backbones such as ResNet and Vision Transformer. Our code is available at https://github.com/ChenJoya/dropit.

Related papers

Inverted Activations: Reducing Memory Footprint in Neural Network Training [5.070981175240306]
A significant challenge in neural network training is the memory footprint associated with activation tensors. We propose a modification to the handling of activation tensors in pointwise nonlinearity layers. We show that our method significantly reduces memory usage without affecting training accuracy or computational performance.
arXiv Detail & Related papers (2024-07-22T11:11:17Z)
NeuroFlux: Memory-Efficient CNN Training Using Adaptive Local Learning [2.61072980439312]
Convolutional Neural Network (CNN) training in resource-constrained mobile and edge environments is an open challenge. Backpropagation is the standard approach adopted, but it is GPU memory intensive due to its strong inter-layer dependencies. We introduce NeuroFlux, a novel CNN training system tailored for memory-constrained scenarios.
arXiv Detail & Related papers (2024-02-21T21:33:07Z)
Coop: Memory is not a Commodity [0.9667631210393929]
tensor rematerialization allows the training of deep neural networks (DNNs) under limited memory budgets. We propose to evict tensors within a sliding window to ensure all evictions are contiguous and are immediately used. We also propose cheap tensor partitioning and recomputable in-place to further reduce the rematerialization cost.
arXiv Detail & Related papers (2023-11-01T15:35:51Z)
Tensor Completion via Leverage Sampling and Tensor QR Decomposition for Network Latency Estimation [2.982069479212266]
A large scale of network latency estimation requires a lot of computing time. We propose a new method that is much faster and maintains high accuracy. Numerical experiments witness that our method is faster than state-of-art algorithms with satisfactory accuracy.
arXiv Detail & Related papers (2023-06-27T07:21:26Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory. Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z)
Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers. Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training. Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z)
Cherry-Picking Gradients: Learning Low-Rank Embeddings of Visual Data via Differentiable Cross-Approximation [53.95297550117153]
We propose an end-to-end trainable framework that processes large-scale visual data tensors by looking emphat a fraction of their entries only. The proposed approach is particularly useful for large-scale multidimensional grid data, and for tasks that require context over a large receptive field.
arXiv Detail & Related papers (2021-05-29T08:39:57Z)
Beyond Lazy Training for Over-parameterized Tensor Decomposition [69.4699995828506]
We show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data. Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
arXiv Detail & Related papers (2020-10-22T00:32:12Z)
Towards Compact Neural Networks via End-to-End Training: A Bayesian Tensor Approach with Automatic Rank Determination [11.173092834726528]
It is desirable to directly train a compact neural network from scratch with low memory and low computational cost. Low-rank tensor decomposition is one of the most effective approaches to reduce the memory and computing requirements of large-size neural networks. This paper presents a novel end-to-end framework for low-rank tensorized training of neural networks.
arXiv Detail & Related papers (2020-10-17T01:23:26Z)
Efficient Integer-Arithmetic-Only Convolutional Neural Networks [87.01739569518513]
We replace conventional ReLU with Bounded ReLU and find that the decline is due to activation quantization. Our integer networks achieve equivalent performance as the corresponding FPN networks, but have only 1/4 memory cost and run 2x faster on modern GPU.
arXiv Detail & Related papers (2020-06-21T08:23:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.