A Negative Result on Gradient Matching for Selective Backprop
- URL: http://arxiv.org/abs/2312.05021v1
- Date: Fri, 8 Dec 2023 13:03:10 GMT
- Title: A Negative Result on Gradient Matching for Selective Backprop
- Authors: Lukas Balles, Cedric Archambeau, Giovanni Zappella
- Abstract summary: Training deep neural networks becomes a massive computational burden.
One approach to speed up the training process is Selective Backprop.
We build on this approach by choosing the (weighted) subset which best matches the mean gradient over the entire minibatch.
We find that both the loss-based as well as the gradient-matching strategy fail to consistently outperform the random baseline.
- Score: 8.463693396893731
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With increasing scale in model and dataset size, the training of deep neural
networks becomes a massive computational burden. One approach to speed up the
training process is Selective Backprop. For this approach, we perform a forward
pass to obtain a loss value for each data point in a minibatch. The backward
pass is then restricted to a subset of that minibatch, prioritizing high-loss
examples. We build on this approach, but seek to improve the subset selection
mechanism by choosing the (weighted) subset which best matches the mean
gradient over the entire minibatch. We use the gradients w.r.t. the model's
last layer as a cheap proxy, resulting in virtually no overhead in addition to
the forward pass. At the same time, for our experiments we add a simple random
selection baseline which has been absent from prior work. Surprisingly, we find
that both the loss-based as well as the gradient-matching strategy fail to
consistently outperform the random baseline.
Related papers
- Data Pruning via Moving-one-Sample-out [61.45441981346064]
We propose a novel data-pruning approach called moving-one-sample-out (MoSo)
MoSo aims to identify and remove the least informative samples from the training set.
Experimental results demonstrate that MoSo effectively mitigates severe performance degradation at high pruning ratios.
arXiv Detail & Related papers (2023-10-23T08:00:03Z) - Dropout Reduces Underfitting [85.61466286688385]
In this study, we demonstrate that dropout can also mitigate underfitting when used at the start of training.
We find dropout reduces the directional variance of gradients across mini-batches and helps align the mini-batch gradients with the entire dataset's gradient.
Our findings lead us to a solution for improving performance in underfitting models - early dropout: dropout is applied only during the initial phases of training, and turned off afterwards.
arXiv Detail & Related papers (2023-03-02T18:59:15Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - SIMPLE: A Gradient Estimator for $k$-Subset Sampling [42.38652558807518]
In this work, we fall back to discrete $k$-subset sampling on the forward pass.
We show that our gradient estimator, SIMPLE, exhibits lower bias and variance compared to state-of-the-art estimators.
Empirical results show improved performance on learning to explain and sparse linear regression.
arXiv Detail & Related papers (2022-10-04T22:33:16Z) - Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models.
We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers.
A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z) - MBGDT:Robust Mini-Batch Gradient Descent [4.141960931064351]
We introduce a new method with the base learner, such as Bayesian regression or gradient descent, to solve the problem of the vulnerability in the model.
Because the mini-batch gradient descent allows for a more robust convergence, we work a method with the mini-batch gradient descent, called Mini-Batch Gradient Descent with Trimming (MBGDT)
Our method show state-of-art performance and have greater robustness than several baselines when we apply our method in designed dataset.
arXiv Detail & Related papers (2022-06-14T19:52:23Z) - Superpolynomial Lower Bounds for Learning One-Layer Neural Networks
using Gradient Descent [25.589302381660453]
We show that any trained using gradient descent with respect to square-loss distribution will fail to achieve small test error in time.
For classification, we give a stronger result, namely that any statistical query (SQ) will fail to achieve small test error in time.
arXiv Detail & Related papers (2020-06-22T05:15:06Z) - Carath\'eodory Sampling for Stochastic Gradient Descent [79.55586575988292]
We present an approach that is inspired by classical results of Tchakaloff and Carath'eodory about measure reduction.
We adaptively select the descent steps where the measure reduction is carried out.
We combine this with Block Coordinate Descent so that measure reduction can be done very cheaply.
arXiv Detail & Related papers (2020-06-02T17:52:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.