Related papers: Stochastic Backpropagation: A Memory Efficient Strategy for Training Video Models

Stochastic Backpropagation: A Memory Efficient Strategy for Training Video Models

URL: http://arxiv.org/abs/2203.16755v1
Date: Thu, 31 Mar 2022 02:24:53 GMT
Title: Stochastic Backpropagation: A Memory Efficient Strategy for Training Video Models
Authors: Feng Cheng, Mingze Xu, Yuanjun Xiong, Hao Chen, Xinyu Li, Wei Li, Wei Xia
Abstract summary: We propose a memory efficient method, named Backpropagation (SBP), for training deep neural networks on videos. Experiments show that SBP can be applied to a wide range of models for video tasks, leading to up to 80.0% GPU memory saving and 10% training speedup with less than 1% accuracy drop on action recognition and temporal action detection.
Score: 42.31924917984774
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose a memory efficient method, named Stochastic Backpropagation (SBP), for training deep neural networks on videos. It is based on the finding that gradients from incomplete execution for backpropagation can still effectively train the models with minimal accuracy loss, which attributes to the high redundancy of video. SBP keeps all forward paths but randomly and independently removes the backward paths for each network layer in each training step. It reduces the GPU memory cost by eliminating the need to cache activation values corresponding to the dropped backward paths, whose amount can be controlled by an adjustable keep-ratio. Experiments show that SBP can be applied to a wide range of models for video tasks, leading to up to 80.0% GPU memory saving and 10% training speedup with less than 1% accuracy drop on action recognition and temporal action detection.

Related papers

Dynamic Gradient Sparse Update for Edge Training [0.0502254944841629]
gradient computation for backpropagation in the training requires significant memory buffers to store intermediate features and compute losses. This is unacceptable for memory-constrained edge devices such as microcontrollers. We propose a training acceleration method using dynamic gradient sparse updates to reduce memory usage.
arXiv Detail & Related papers (2025-03-23T06:32:12Z)
Memory-Efficient Training for Deep Speaker Embedding Learning in Speaker Verification [50.596077598766975]
We explore a memory-efficient training strategy for deep speaker embedding learning in resource-constrained scenarios. For activations, we design two types of reversible neural networks which eliminate the need to store intermediate activations. For states, we introduce a dynamic quantization approach that replaces the original 32-bit floating-point values with a dynamic tree-based 8-bit data type.
arXiv Detail & Related papers (2024-12-02T06:57:46Z)
Block Selective Reprogramming for On-device Training of Vision Transformers [12.118303034660531]
We present block selective reprogramming (BSR) in which we fine-tune only a fraction of total blocks of a pre-trained model. Compared to the existing alternatives, our approach simultaneously reduces training memory by up to 1.4x and compute cost by up to 2x.
arXiv Detail & Related papers (2024-03-25T08:41:01Z)
DropBP: Accelerating Fine-Tuning of Large Language Models by Dropping Backward Propagation [13.768426626459558]
We propose Dropping Backward Propagation (DropBP) to reduce computational costs and activation memory while maintaining accuracy. DropBP randomly drops layers during backward propagation, which is essentially equivalent to training shallow submodules. It can reduce training time by 44% with comparable accuracy to the baseline, accelerate convergence to the same perplexity by 1.5x, and enable training with a sequence length 6.2x larger on a single NVIDIA-A100 GPU.
arXiv Detail & Related papers (2024-02-27T14:51:11Z)
Re^2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization [65.33914980022303]
Temporal action localization (TAL) requires long-form reasoning to predict actions of various durations and complex content. Most methods can only train on pre-extracted features without optimizing them for the localization problem. We propose a novel end-to-end method Re2TAL, which rewires pretrained video backbones for reversible TAL.
arXiv Detail & Related papers (2022-11-25T12:17:30Z)
An In-depth Study of Stochastic Backpropagation [44.953669040828345]
We study Backpropagation (SBP) when training deep neural networks for standard image classification and object detection tasks. During backward propagation, SBP calculates gradients by only using a subset of feature maps to save the GPU memory and computational cost. Experiments on image classification and object detection show that SBP can save up to 40% of GPU memory with less than 1% accuracy.
arXiv Detail & Related papers (2022-09-30T23:05:06Z)
On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory. Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z)
Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers. Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training. Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z)
GDP: Stabilized Neural Network Pruning via Gates with Differentiable Polarization [84.57695474130273]
Gate-based or importance-based pruning methods aim to remove channels whose importance is smallest. GDP can be plugged before convolutional layers without bells and whistles, to control the on-and-off of each channel. Experiments conducted over CIFAR-10 and ImageNet datasets show that the proposed GDP achieves the state-of-the-art performance.
arXiv Detail & Related papers (2021-09-06T03:17:10Z)
Improving compute efficacy frontiers with SliceOut [31.864949424541344]
We introduce SliceOut -- a dropout-inspired scheme to train deep learning models faster without impacting final test accuracy. At test time, turning off SliceOut performs an implicit ensembling across a linear number of architectures that preserves test accuracy. This leads to faster processing of large computational workloads overall, and significantly reduce the resulting energy consumption and CO2emissions.
arXiv Detail & Related papers (2020-07-21T15:59:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.