Linear Gradient Prediction with Control Variates
- URL: http://arxiv.org/abs/2511.05187v1
- Date: Fri, 07 Nov 2025 12:09:48 GMT
- Title: Linear Gradient Prediction with Control Variates
- Authors: Kamil Ciosek, Nicolò Felicioni, Juan Elenter Litwin,
- Abstract summary: We propose a new way of training neural networks, with the goal of reducing training cost.<n>Our method uses approximate predicted gradients instead of the full gradients that require an expensive backward pass.<n>We empirically show the efficacy of the technique on a vision transformer classification task.
- Score: 5.907996850796288
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a new way of training neural networks, with the goal of reducing training cost. Our method uses approximate predicted gradients instead of the full gradients that require an expensive backward pass. We derive a control-variate-based technique that ensures our updates are unbiased estimates of the true gradient. Moreover, we propose a novel way to derive a predictor for the gradient inspired by the theory of the Neural Tangent Kernel. We empirically show the efficacy of the technique on a vision transformer classification task.
Related papers
- One-Step Forward and Backtrack: Overcoming Zig-Zagging in Loss-Aware
Quantization Training [12.400950982075948]
Weight quantization is an effective technique to compress deep neural networks for their deployment on edge devices with limited resources.
Traditional loss-aware quantization methods commonly use the quantized gradient to replace the full-precision gradient.
This paper proposes a one-step forward and backtrack way for loss-aware quantization to get more accurate and stable gradient direction.
arXiv Detail & Related papers (2024-01-30T05:42:54Z) - Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective.
We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices.
Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z) - Can Forward Gradient Match Backpropagation? [2.875726839945885]
Forward Gradients have been shown to be utilizable for neural network training.
We propose to strongly bias our gradient guesses in directions that are much more promising, such as feedback obtained from small, local auxiliary networks.
We find that using gradients obtained from a local loss as a candidate direction drastically improves on random noise in Forward Gradient methods.
arXiv Detail & Related papers (2023-06-12T08:53:41Z) - Angle based dynamic learning rate for gradient descent [2.5077510176642805]
We propose a novel yet simple approach to obtain an adaptive learning rate for gradient-based descent methods on classification tasks.
Instead of the traditional approach of selecting adaptive learning rates via the expectation of gradient-based terms, we use the angle between the current gradient and the new gradient.
We find that our method leads to the highest accuracy in most of the datasets.
arXiv Detail & Related papers (2023-04-20T16:55:56Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - On Training Implicit Models [75.20173180996501]
We propose a novel gradient estimate for implicit models, named phantom gradient, that forgoes the costly computation of the exact gradient.
Experiments on large-scale tasks demonstrate that these lightweight phantom gradients significantly accelerate the backward passes in training implicit models by roughly 1.7 times.
arXiv Detail & Related papers (2021-11-09T14:40:24Z) - Adapting Stepsizes by Momentumized Gradients Improves Optimization and
Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z) - Backward Gradient Normalization in Deep Neural Networks [68.8204255655161]
We introduce a new technique for gradient normalization during neural network training.
The gradients are rescaled during the backward pass using normalization layers introduced at certain points within the network architecture.
Results on tests with very deep neural networks show that the new technique can do an effective control of the gradient norm.
arXiv Detail & Related papers (2021-06-17T13:24:43Z) - A Bayesian Perspective on Training Speed and Model Selection [51.15664724311443]
We show that a measure of a model's training speed can be used to estimate its marginal likelihood.
We verify our results in model selection tasks for linear models and for the infinite-width limit of deep neural networks.
Our results suggest a promising new direction towards explaining why neural networks trained with gradient descent are biased towards functions that generalize well.
arXiv Detail & Related papers (2020-10-27T17:56:14Z) - Gravilon: Applications of a New Gradient Descent Method to Machine
Learning [0.5352699766206809]
We provide a novel gradient descent algorithm, called Gravilon, that uses the geometry of the hypersurface to modify the length of the step in the direction of the gradient.
Using neural networks, we provide promising experimental results comparing the accuracy and efficiency of Gravilon against commonly used gradient descent algorithms on MNIST digit classification.
arXiv Detail & Related papers (2020-08-26T04:02:02Z) - Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated.
We propose a new method for this estimation problem combining sampling and analytic approximation steps.
We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z) - Estimating Training Data Influence by Tracing Gradient Descent [21.94989239842377]
TracIn computes the influence of a training example on a prediction made by the model.
TracIn is simple to implement; all it needs is the ability to work agnostic loss functions.
arXiv Detail & Related papers (2020-02-19T22:40:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.