Efficient Backpropagation with Variance-Controlled Adaptive Sampling
- URL: http://arxiv.org/abs/2402.17227v1
- Date: Tue, 27 Feb 2024 05:40:36 GMT
- Title: Efficient Backpropagation with Variance-Controlled Adaptive Sampling
- Authors: Ziteng Wang, Jianfei Chen, Jun Zhu
- Abstract summary: Sampling-based algorithms, which eliminate ''unimportant'' computations during forward and/or back propagation (BP), offer potential solutions to accelerate neural network training.
We introduce a variance-controlled adaptive sampling (VCAS) method designed to accelerate BP.
VCAS can preserve the original training loss trajectory and validation accuracy with an up to 73.87% FLOPs reduction of BP and 49.58% FLOPs reduction of the whole training process.
- Score: 32.297478086982466
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sampling-based algorithms, which eliminate ''unimportant'' computations
during forward and/or back propagation (BP), offer potential solutions to
accelerate neural network training. However, since sampling introduces
approximations to training, such algorithms may not consistently maintain
accuracy across various tasks. In this work, we introduce a variance-controlled
adaptive sampling (VCAS) method designed to accelerate BP. VCAS computes an
unbiased stochastic gradient with fine-grained layerwise importance sampling in
data dimension for activation gradient calculation and leverage score sampling
in token dimension for weight gradient calculation. To preserve accuracy, we
control the additional variance by learning the sample ratio jointly with model
parameters during training. We assessed VCAS on multiple fine-tuning and
pre-training tasks in both vision and natural language domains. On all the
tasks, VCAS can preserve the original training loss trajectory and validation
accuracy with an up to 73.87% FLOPs reduction of BP and 49.58% FLOPs reduction
of the whole training process. The implementation is available at
https://github.com/thu-ml/VCAS .
Related papers
- SeWA: Selective Weight Average via Probabilistic Masking [51.015724517293236]
We show that only a few points are needed to achieve better and faster convergence.
We transform the discrete selection problem into a continuous subset optimization framework.
We derive the SeWA's stability bounds, which are sharper than that under both convex image checkpoints.
arXiv Detail & Related papers (2025-02-14T12:35:21Z) - Exploring Variance Reduction in Importance Sampling for Efficient DNN Training [1.7767466724342067]
This paper proposes a method for estimating variance reduction during deep neural network (DNN) training using only minibatches sampled under importance sampling.
An absolute metric to quantify the efficiency of importance sampling is also introduced as well as an algorithm for real-time estimation of importance scores based on moving gradient statistics.
arXiv Detail & Related papers (2025-01-23T00:43:34Z) - The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws [51.608402959163925]
We present the first systematic exploration of optimal sparse pre-training configurations for large language models.
We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss.
We propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training.
arXiv Detail & Related papers (2025-01-21T20:23:22Z) - On the Convergence of Loss and Uncertainty-based Active Learning Algorithms [3.506897386829711]
We investigate the convergence rates and data sample sizes required for training a machine learning model using a gradient descent (SGD) algorithm.
We present convergence results for linear classifiers and linearly separable datasets using squared hinge loss and similar training loss functions.
arXiv Detail & Related papers (2023-12-21T15:22:07Z) - Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT)
We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z) - Online Importance Sampling for Stochastic Gradient Optimization [33.42221341526944]
We propose a practical algorithm that efficiently computes data importance on-the-fly during training.
We also introduce a novel metric based on the derivative of the loss w.r.t. the network output, designed for mini-batch importance sampling.
arXiv Detail & Related papers (2023-11-24T13:21:35Z) - KAKURENBO: Adaptively Hiding Samples in Deep Neural Network Training [2.8804804517897935]
We propose a method for hiding the least-important samples during the training of deep neural networks.
We adaptively find samples to exclude in a given epoch based on their contribution to the overall learning process.
Our method can reduce total training time by up to 22% impacting accuracy only by 0.4% compared to the baseline.
arXiv Detail & Related papers (2023-10-16T06:19:29Z) - An In-depth Study of Stochastic Backpropagation [44.953669040828345]
We study Backpropagation (SBP) when training deep neural networks for standard image classification and object detection tasks.
During backward propagation, SBP calculates gradients by only using a subset of feature maps to save the GPU memory and computational cost.
Experiments on image classification and object detection show that SBP can save up to 40% of GPU memory with less than 1% accuracy.
arXiv Detail & Related papers (2022-09-30T23:05:06Z) - Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning.
Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch.
ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z) - Predicting Training Time Without Training [120.92623395389255]
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function.
We leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model.
We are able to predict the time it takes to fine-tune a model to a given loss without having to perform any training.
arXiv Detail & Related papers (2020-08-28T04:29:54Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.