Moonwalk: Inverse-Forward Differentiation
- URL: http://arxiv.org/abs/2402.14212v1
- Date: Thu, 22 Feb 2024 01:33:31 GMT
- Title: Moonwalk: Inverse-Forward Differentiation
- Authors: Dmitrii Krylov, Armin Karamzade, Roy Fox
- Abstract summary: Forward-mode gradient computation is an alternative in invertible networks.
Moonwalk is the first forward-based method to compute true gradients in invertible networks in computation time comparable to backpropagation.
- Score: 4.425689868461635
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Backpropagation, while effective for gradient computation, falls short in
addressing memory consumption, limiting scalability. This work explores
forward-mode gradient computation as an alternative in invertible networks,
showing its potential to reduce the memory footprint without substantial
drawbacks. We introduce a novel technique based on a vector-inverse-Jacobian
product that accelerates the computation of forward gradients while retaining
the advantages of memory reduction and preserving the fidelity of true
gradients. Our method, Moonwalk, has a time complexity linear in the depth of
the network, unlike the quadratic time complexity of na\"ive forward, and
empirically reduces computation time by several orders of magnitude without
allocating more memory. We further accelerate Moonwalk by combining it with
reverse-mode differentiation to achieve time complexity comparable with
backpropagation while maintaining a much smaller memory footprint. Finally, we
showcase the robustness of our method across several architecture choices.
Moonwalk is the first forward-based method to compute true gradients in
invertible networks in computation time comparable to backpropagation and using
significantly less memory.
Related papers
- Accelerated Training through Iterative Gradient Propagation Along the Residual Path [46.577761606415805]
Highway backpropagation is a parallelizable iterative algorithm that approximates backpropagation.
It is adaptable to a diverse set of common architectures, ranging from ResNets and Transformers to recurrent neural networks.
arXiv Detail & Related papers (2025-01-28T17:14:42Z) - Inverted Activations: Reducing Memory Footprint in Neural Network Training [5.070981175240306]
A significant challenge in neural network training is the memory footprint associated with activation tensors.
We propose a modification to the handling of activation tensors in pointwise nonlinearity layers.
We show that our method significantly reduces memory usage without affecting training accuracy or computational performance.
arXiv Detail & Related papers (2024-07-22T11:11:17Z) - How to guess a gradient [68.98681202222664]
We show that gradients are more structured than previously thought.
Exploiting this structure can significantly improve gradient-free optimization schemes.
We highlight new challenges in overcoming the large gap between optimizing with exact gradients and guessing the gradients.
arXiv Detail & Related papers (2023-12-07T21:40:44Z) - Towards Memory- and Time-Efficient Backpropagation for Training Spiking
Neural Networks [70.75043144299168]
Spiking Neural Networks (SNNs) are promising energy-efficient models for neuromorphic computing.
We propose the Spatial Learning Through Time (SLTT) method that can achieve high performance while greatly improving training efficiency.
Our method achieves state-of-the-art accuracy on ImageNet, while the memory cost and training time are reduced by more than 70% and 50%, respectively, compared with BPTT.
arXiv Detail & Related papers (2023-02-28T05:01:01Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Nonsmooth automatic differentiation: a cheap gradient principle and
other complexity results [0.0]
We provide a model to estimate the computational costs of the backward and forward modes of algorithmic differentiation for a wide class of nonsmooth programs.
Prominent examples are the famous relu and convolutional neural networks together with their standard loss functions.
arXiv Detail & Related papers (2022-06-01T08:43:35Z) - Efficient Neural Network Training via Forward and Backward Propagation
Sparsification [26.301103403328312]
We propose an efficient sparse training method with completely sparse forward and backward passes.
Our algorithm is much more effective in accelerating the training process, up to an order of magnitude faster.
arXiv Detail & Related papers (2021-11-10T13:49:47Z) - Low-memory stochastic backpropagation with multi-channel randomized
trace estimation [6.985273194899884]
We propose to approximate the gradient of convolutional layers in neural networks with a multi-channel randomized trace estimation technique.
Compared to other methods, this approach is simple, amenable to analyses, and leads to a greatly reduced memory footprint.
We discuss the performance of networks trained with backpropagation and how the error can be controlled while maximizing memory usage and minimizing computational overhead.
arXiv Detail & Related papers (2021-06-13T13:54:02Z) - Short-Term Memory Optimization in Recurrent Neural Networks by
Autoencoder-based Initialization [79.42778415729475]
We explore an alternative solution based on explicit memorization using linear autoencoders for sequences.
We show how such pretraining can better support solving hard classification tasks with long sequences.
We show that the proposed approach achieves a much lower reconstruction error for long sequences and a better gradient propagation during the finetuning phase.
arXiv Detail & Related papers (2020-11-05T14:57:16Z) - Channel-Directed Gradients for Optimization of Convolutional Neural
Networks [50.34913837546743]
We introduce optimization methods for convolutional neural networks that can be used to improve existing gradient-based optimization in terms of generalization error.
We show that defining the gradients along the output channel direction leads to a performance boost, while other directions can be detrimental.
arXiv Detail & Related papers (2020-08-25T00:44:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.