Related papers: Moonwalk: Inverse-Forward Differentiation

Moonwalk: Inverse-Forward Differentiation

URL: http://arxiv.org/abs/2402.14212v1
Date: Thu, 22 Feb 2024 01:33:31 GMT
Title: Moonwalk: Inverse-Forward Differentiation
Authors: Dmitrii Krylov, Armin Karamzade, Roy Fox
Abstract summary: Forward-mode gradient computation is an alternative in invertible networks. Moonwalk is the first forward-based method to compute true gradients in invertible networks in computation time comparable to backpropagation.
Score: 4.425689868461635
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Backpropagation, while effective for gradient computation, falls short in addressing memory consumption, limiting scalability. This work explores forward-mode gradient computation as an alternative in invertible networks, showing its potential to reduce the memory footprint without substantial drawbacks. We introduce a novel technique based on a vector-inverse-Jacobian product that accelerates the computation of forward gradients while retaining the advantages of memory reduction and preserving the fidelity of true gradients. Our method, Moonwalk, has a time complexity linear in the depth of the network, unlike the quadratic time complexity of na\"ive forward, and empirically reduces computation time by several orders of magnitude without allocating more memory. We further accelerate Moonwalk by combining it with reverse-mode differentiation to achieve time complexity comparable with backpropagation while maintaining a much smaller memory footprint. Finally, we showcase the robustness of our method across several architecture choices. Moonwalk is the first forward-based method to compute true gradients in invertible networks in computation time comparable to backpropagation and using significantly less memory.

Related papers

Accelerated Training through Iterative Gradient Propagation Along the Residual Path [46.577761606415805]
Highway backpropagation is a parallelizable iterative algorithm that approximates backpropagation. It is adaptable to a diverse set of common architectures, ranging from ResNets and Transformers to recurrent neural networks.
arXiv Detail & Related papers (2025-01-28T17:14:42Z)
Inverted Activations: Reducing Memory Footprint in Neural Network Training [5.070981175240306]
A significant challenge in neural network training is the memory footprint associated with activation tensors. We propose a modification to the handling of activation tensors in pointwise nonlinearity layers. We show that our method significantly reduces memory usage without affecting training accuracy or computational performance.
arXiv Detail & Related papers (2024-07-22T11:11:17Z)
UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs [111.12010207132204]
UIO-LLMs is an incremental optimization approach for memory-enhanced transformers under long-context settings. We refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm. UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters.
arXiv Detail & Related papers (2024-06-26T08:44:36Z)
How to guess a gradient [68.98681202222664]
We show that gradients are more structured than previously thought. Exploiting this structure can significantly improve gradient-free optimization schemes. We highlight new challenges in overcoming the large gap between optimizing with exact gradients and guessing the gradients.
arXiv Detail & Related papers (2023-12-07T21:40:44Z)
Towards Memory- and Time-Efficient Backpropagation for Training Spiking Neural Networks [70.75043144299168]
Spiking Neural Networks (SNNs) are promising energy-efficient models for neuromorphic computing. We propose the Spatial Learning Through Time (SLTT) method that can achieve high performance while greatly improving training efficiency. Our method achieves state-of-the-art accuracy on ImageNet, while the memory cost and training time are reduced by more than 70% and 50%, respectively, compared with BPTT.
arXiv Detail & Related papers (2023-02-28T05:01:01Z)
Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks. We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights. Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z)
Nonsmooth automatic differentiation: a cheap gradient principle and other complexity results [0.0]
We provide a model to estimate the computational costs of the backward and forward modes of algorithmic differentiation for a wide class of nonsmooth programs. Prominent examples are the famous relu and convolutional neural networks together with their standard loss functions.
arXiv Detail & Related papers (2022-06-01T08:43:35Z)
Few-Bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction [4.243810214656324]
Memory footprint is one of the main limiting factors for large neural network training. We propose a systematic approach to compute optimal quantization of the retained gradients of the pointwise nonlinear functions. We show that such approximation can be achieved by computing optimal piecewise-constant approximation of the derivative of the activation function.
arXiv Detail & Related papers (2022-02-01T14:51:38Z)
Efficient Neural Network Training via Forward and Backward Propagation Sparsification [26.301103403328312]
We propose an efficient sparse training method with completely sparse forward and backward passes. Our algorithm is much more effective in accelerating the training process, up to an order of magnitude faster.
arXiv Detail & Related papers (2021-11-10T13:49:47Z)
Low-memory stochastic backpropagation with multi-channel randomized trace estimation [6.985273194899884]
We propose to approximate the gradient of convolutional layers in neural networks with a multi-channel randomized trace estimation technique. Compared to other methods, this approach is simple, amenable to analyses, and leads to a greatly reduced memory footprint. We discuss the performance of networks trained with backpropagation and how the error can be controlled while maximizing memory usage and minimizing computational overhead.
arXiv Detail & Related papers (2021-06-13T13:54:02Z)
Short-Term Memory Optimization in Recurrent Neural Networks by Autoencoder-based Initialization [79.42778415729475]
We explore an alternative solution based on explicit memorization using linear autoencoders for sequences. We show how such pretraining can better support solving hard classification tasks with long sequences. We show that the proposed approach achieves a much lower reconstruction error for long sequences and a better gradient propagation during the finetuning phase.
arXiv Detail & Related papers (2020-11-05T14:57:16Z)
Channel-Directed Gradients for Optimization of Convolutional Neural Networks [50.34913837546743]
We introduce optimization methods for convolutional neural networks that can be used to improve existing gradient-based optimization in terms of generalization error. We show that defining the gradients along the output channel direction leads to a performance boost, while other directions can be detrimental.
arXiv Detail & Related papers (2020-08-25T00:44:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.