Towards Scalable Backpropagation-Free Gradient Estimation
- URL: http://arxiv.org/abs/2511.03110v1
- Date: Wed, 05 Nov 2025 01:33:19 GMT
- Title: Towards Scalable Backpropagation-Free Gradient Estimation
- Authors: Daniel Wang, Evan Markou, Dylan Campbell,
- Abstract summary: We introduce a gradient estimation approach that reduces both bias and variance by manipulating upstream Jacobian matrices when computing guess directions.<n>It shows promising results and has the potential to scale to larger networks, indeed performing better as the network width is increased.
- Score: 19.04392409045641
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While backpropagation--reverse-mode automatic differentiation--has been extraordinarily successful in deep learning, it requires two passes (forward and backward) through the neural network and the storage of intermediate activations. Existing gradient estimation methods that instead use forward-mode automatic differentiation struggle to scale beyond small networks due to the high variance of the estimates. Efforts to mitigate this have so far introduced significant bias to the estimates, reducing their utility. We introduce a gradient estimation approach that reduces both bias and variance by manipulating upstream Jacobian matrices when computing guess directions. It shows promising results and has the potential to scale to larger networks, indeed performing better as the network width is increased. Our understanding of this method is facilitated by analyses of bias and variance, and their connection to the low-dimensional structure of neural network gradients.
Related papers
- Can Forward Gradient Match Backpropagation? [2.875726839945885]
Forward Gradients have been shown to be utilizable for neural network training.
We propose to strongly bias our gradient guesses in directions that are much more promising, such as feedback obtained from small, local auxiliary networks.
We find that using gradients obtained from a local loss as a candidate direction drastically improves on random noise in Forward Gradient methods.
arXiv Detail & Related papers (2023-06-12T08:53:41Z) - Combining Explicit and Implicit Regularization for Efficient Learning in
Deep Networks [3.04585143845864]
In deep linear networks, gradient descent implicitly regularizes toward low-rank solutions on matrix completion/factorization tasks.
We propose an explicit penalty to mirror this implicit bias which only takes effect with certain adaptive gradient generalizations.
This combination can enable a single-layer network to achieve low-rank approximations with degenerate error comparable to deep linear networks.
arXiv Detail & Related papers (2023-06-01T04:47:17Z) - Efficient Estimation for Longitudinal Networks via Adaptive Merging [21.62069959992736]
We propose an efficient estimation framework for longitudinal network, leveraging strengths of adaptive network merging, tensor decomposition and point process.
It merges neighboring sparse networks so as to enlarge the number of observed edges and reduce estimation variance.
A projected descent algorithm is proposed to facilitate estimation, where upper bound of the estimation error in each iteration is established.
arXiv Detail & Related papers (2022-11-15T03:17:11Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Optimization-Based Separations for Neural Networks [57.875347246373956]
We show that gradient descent can efficiently learn ball indicator functions using a depth 2 neural network with two layers of sigmoidal activations.
This is the first optimization-based separation result where the approximation benefits of the stronger architecture provably manifest in practice.
arXiv Detail & Related papers (2021-12-04T18:07:47Z) - Low-memory stochastic backpropagation with multi-channel randomized
trace estimation [6.985273194899884]
We propose to approximate the gradient of convolutional layers in neural networks with a multi-channel randomized trace estimation technique.
Compared to other methods, this approach is simple, amenable to analyses, and leads to a greatly reduced memory footprint.
We discuss the performance of networks trained with backpropagation and how the error can be controlled while maximizing memory usage and minimizing computational overhead.
arXiv Detail & Related papers (2021-06-13T13:54:02Z) - Implicit Under-Parameterization Inhibits Data-Efficient Deep
Reinforcement Learning [97.28695683236981]
More gradient updates decrease the expressivity of the current value network.
We demonstrate this phenomenon on Atari and Gym benchmarks, in both offline and online RL settings.
arXiv Detail & Related papers (2020-10-27T17:55:16Z) - Scaling Equilibrium Propagation to Deep ConvNets by Drastically Reducing
its Gradient Estimator Bias [65.13042449121411]
In practice, training a network with the gradient estimates provided by EP does not scale to visual tasks harder than MNIST.
We show that a bias in the gradient estimate of EP, inherent in the use of finite nudging, is responsible for this phenomenon.
We apply these techniques to train an architecture with asymmetric forward and backward connections, yielding a 13.2% test error.
arXiv Detail & Related papers (2020-06-06T09:36:07Z) - Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated.
We propose a new method for this estimation problem combining sampling and analytic approximation steps.
We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.