Improving compute efficacy frontiers with SliceOut
- URL: http://arxiv.org/abs/2007.10909v2
- Date: Wed, 31 Mar 2021 23:06:40 GMT
- Title: Improving compute efficacy frontiers with SliceOut
- Authors: Pascal Notin, Aidan N. Gomez, Joanna Yoo, Yarin Gal
- Abstract summary: We introduce SliceOut -- a dropout-inspired scheme to train deep learning models faster without impacting final test accuracy.
At test time, turning off SliceOut performs an implicit ensembling across a linear number of architectures that preserves test accuracy.
This leads to faster processing of large computational workloads overall, and significantly reduce the resulting energy consumption and CO2emissions.
- Score: 31.864949424541344
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pushing forward the compute efficacy frontier in deep learning is critical
for tasks that require frequent model re-training or workloads that entail
training a large number of models. We introduce SliceOut -- a dropout-inspired
scheme designed to take advantage of GPU memory layout to train deep learning
models faster without impacting final test accuracy. By dropping contiguous
sets of units at random, our method realises training speedups through (1) fast
memory access and matrix multiplication of smaller tensors, and (2) memory
savings by avoiding allocating memory to zero units in weight gradients and
activations. At test time, turning off SliceOut performs an implicit ensembling
across a linear number of architectures that preserves test accuracy. We
demonstrate 10-40% speedups and memory reduction with Wide ResNets,
EfficientNets, and Transformer models, with minimal to no loss in accuracy.
This leads to faster processing of large computational workloads overall, and
significantly reduce the resulting energy consumption and CO2emissions.
Related papers
- Block Selective Reprogramming for On-device Training of Vision Transformers [12.118303034660531]
We present block selective reprogramming (BSR) in which we fine-tune only a fraction of total blocks of a pre-trained model.
Compared to the existing alternatives, our approach simultaneously reduces training memory by up to 1.4x and compute cost by up to 2x.
arXiv Detail & Related papers (2024-03-25T08:41:01Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Rediscovering Hashed Random Projections for Efficient Quantization of
Contextualized Sentence Embeddings [113.38884267189871]
Training and inference on edge devices often requires an efficient setup due to computational limitations.
Pre-computing data representations and caching them on a server can mitigate extensive edge device computation.
We propose a simple, yet effective approach that uses randomly hyperplane projections.
We show that the embeddings remain effective for training models across various English and German sentence classification tasks that retain 94%--99% of their floating-point.
arXiv Detail & Related papers (2023-03-13T10:53:00Z) - Towards Memory- and Time-Efficient Backpropagation for Training Spiking
Neural Networks [70.75043144299168]
Spiking Neural Networks (SNNs) are promising energy-efficient models for neuromorphic computing.
We propose the Spatial Learning Through Time (SLTT) method that can achieve high performance while greatly improving training efficiency.
Our method achieves state-of-the-art accuracy on ImageNet, while the memory cost and training time are reduced by more than 70% and 50%, respectively, compared with BPTT.
arXiv Detail & Related papers (2023-02-28T05:01:01Z) - On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory.
Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z) - Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution.
Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x.
We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z) - Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers.
Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training.
Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z) - Layered gradient accumulation and modular pipeline parallelism: fast and
efficient training of large language models [0.0]
We analyse the shortest possible training time for different configurations of distributed training.
We introduce two new methods, textitlayered gradient accumulation and textitmodular pipeline parallelism, which together cut the shortest training time by half.
arXiv Detail & Related papers (2021-06-04T19:21:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.