Efficient Backpropagation with Variance-Controlled Adaptive Sampling
- URL: http://arxiv.org/abs/2402.17227v1
- Date: Tue, 27 Feb 2024 05:40:36 GMT
- Title: Efficient Backpropagation with Variance-Controlled Adaptive Sampling
- Authors: Ziteng Wang, Jianfei Chen, Jun Zhu
- Abstract summary: Sampling-based algorithms, which eliminate ''unimportant'' computations during forward and/or back propagation (BP), offer potential solutions to accelerate neural network training.
We introduce a variance-controlled adaptive sampling (VCAS) method designed to accelerate BP.
VCAS can preserve the original training loss trajectory and validation accuracy with an up to 73.87% FLOPs reduction of BP and 49.58% FLOPs reduction of the whole training process.
- Score: 32.297478086982466
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sampling-based algorithms, which eliminate ''unimportant'' computations
during forward and/or back propagation (BP), offer potential solutions to
accelerate neural network training. However, since sampling introduces
approximations to training, such algorithms may not consistently maintain
accuracy across various tasks. In this work, we introduce a variance-controlled
adaptive sampling (VCAS) method designed to accelerate BP. VCAS computes an
unbiased stochastic gradient with fine-grained layerwise importance sampling in
data dimension for activation gradient calculation and leverage score sampling
in token dimension for weight gradient calculation. To preserve accuracy, we
control the additional variance by learning the sample ratio jointly with model
parameters during training. We assessed VCAS on multiple fine-tuning and
pre-training tasks in both vision and natural language domains. On all the
tasks, VCAS can preserve the original training loss trajectory and validation
accuracy with an up to 73.87% FLOPs reduction of BP and 49.58% FLOPs reduction
of the whole training process. The implementation is available at
https://github.com/thu-ml/VCAS .
Related papers
- On the Convergence of Loss and Uncertainty-based Active Learning Algorithms [3.506897386829711]
We investigate the convergence rates and data sample sizes required for training a machine learning model using a gradient descent (SGD) algorithm.
We present convergence results for linear classifiers and linearly separable datasets using squared hinge loss and similar training loss functions.
arXiv Detail & Related papers (2023-12-21T15:22:07Z) - Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT)
We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z) - KAKURENBO: Adaptively Hiding Samples in Deep Neural Network Training [2.8804804517897935]
We propose a method for hiding the least-important samples during the training of deep neural networks.
We adaptively find samples to exclude in a given epoch based on their contribution to the overall learning process.
Our method can reduce total training time by up to 22% impacting accuracy only by 0.4% compared to the baseline.
arXiv Detail & Related papers (2023-10-16T06:19:29Z) - A Novel Adaptive Causal Sampling Method for Physics-Informed Neural
Networks [35.25394937917774]
Informed Neural Networks (PINNs) have become a kind of attractive machine learning method for obtaining solutions of partial differential equations (PDEs)
We introduce temporal causality into adaptive sampling and propose a novel adaptive causal sampling method to improve the performance and efficiency of PINs.
We demonstrate that by utilizing such a relatively simple sampling method, prediction performance can be improved up to two orders of magnitude compared with state-of-the-art results.
arXiv Detail & Related papers (2022-10-24T01:51:08Z) - An In-depth Study of Stochastic Backpropagation [44.953669040828345]
We study Backpropagation (SBP) when training deep neural networks for standard image classification and object detection tasks.
During backward propagation, SBP calculates gradients by only using a subset of feature maps to save the GPU memory and computational cost.
Experiments on image classification and object detection show that SBP can save up to 40% of GPU memory with less than 1% accuracy.
arXiv Detail & Related papers (2022-09-30T23:05:06Z) - Adaptive Sketches for Robust Regression with Importance Sampling [64.75899469557272]
We introduce data structures for solving robust regression through gradient descent (SGD)
Our algorithm effectively runs $T$ steps of SGD with importance sampling while using sublinear space and just making a single pass over the data.
arXiv Detail & Related papers (2022-07-16T03:09:30Z) - Efficient training of physics-informed neural networks via importance
sampling [2.9005223064604078]
Physics-In Neural Networks (PINNs) are a class of deep neural networks that are trained to compute systems governed by partial differential equations (PDEs)
We show that an importance sampling approach will improve the convergence behavior of PINNs training.
arXiv Detail & Related papers (2021-04-26T02:45:10Z) - Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning.
Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch.
ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z) - Predicting Training Time Without Training [120.92623395389255]
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function.
We leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model.
We are able to predict the time it takes to fine-tune a model to a given loss without having to perform any training.
arXiv Detail & Related papers (2020-08-28T04:29:54Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Spatially Adaptive Inference with Stochastic Feature Sampling and
Interpolation [72.40827239394565]
We propose to compute features only at sparsely sampled locations.
We then densely reconstruct the feature map with an efficient procedure.
The presented network is experimentally shown to save substantial computation while maintaining accuracy over a variety of computer vision tasks.
arXiv Detail & Related papers (2020-03-19T15:36:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.