Low-memory stochastic backpropagation with multi-channel randomized
trace estimation
- URL: http://arxiv.org/abs/2106.06998v2
- Date: Wed, 16 Jun 2021 16:02:56 GMT
- Title: Low-memory stochastic backpropagation with multi-channel randomized
trace estimation
- Authors: Mathias Louboutin, Ali Siahkoohi, Rongrong Wang, Felix J. Herrmann
- Abstract summary: We propose to approximate the gradient of convolutional layers in neural networks with a multi-channel randomized trace estimation technique.
Compared to other methods, this approach is simple, amenable to analyses, and leads to a greatly reduced memory footprint.
We discuss the performance of networks trained with backpropagation and how the error can be controlled while maximizing memory usage and minimizing computational overhead.
- Score: 6.985273194899884
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Thanks to the combination of state-of-the-art accelerators and highly
optimized open software frameworks, there has been tremendous progress in the
performance of deep neural networks. While these developments have been
responsible for many breakthroughs, progress towards solving large-scale
problems, such as video encoding and semantic segmentation in 3D, is hampered
because access to on-premise memory is often limited. Instead of relying on
(optimal) checkpointing or invertibility of the network layers -- to recover
the activations during backpropagation -- we propose to approximate the
gradient of convolutional layers in neural networks with a multi-channel
randomized trace estimation technique. Compared to other methods, this approach
is simple, amenable to analyses, and leads to a greatly reduced memory
footprint. Even though the randomized trace estimation introduces stochasticity
during training, we argue that this is of little consequence as long as the
induced errors are of the same order as errors in the gradient due to the use
of stochastic gradient descent. We discuss the performance of networks trained
with stochastic backpropagation and how the error can be controlled while
maximizing memory usage and minimizing computational overhead.
Related papers
- Correlations Are Ruining Your Gradient Descent [1.2432046687586285]
Natural gradient descent illuminates how gradient vectors, pointing at directions of steepest descent, can be improved by considering the local curvature of loss landscapes.
We show that correlations in the data at any linear transformation, including node responses at every layer of a neural network, cause a non-orthonormal relationship between the model's parameters.
We describe a range of methods which have been proposed for decorrelation and whitening of node output, and expand on these to provide a novel method specifically useful for distributed computing and computational neuroscience.
arXiv Detail & Related papers (2024-07-15T14:59:43Z) - Approximated Likelihood Ratio: A Forward-Only and Parallel Framework for Boosting Neural Network Training [30.452060061499523]
We introduce an approximation technique for the likelihood ratio (LR) method to alleviate computational and memory demands in gradient estimation.
Experiments demonstrate the effectiveness of the approximation technique in neural network training.
arXiv Detail & Related papers (2024-03-18T23:23:50Z) - A Bootstrap Algorithm for Fast Supervised Learning [0.0]
Training a neural network (NN) typically relies on some type of curve-following method, such as gradient descent (and gradient descent (SGD)), ADADELTA, ADAM or limited memory algorithms.
Convergence for these algorithms usually relies on having access to a large quantity of observations in order to achieve a high level of accuracy and, with certain classes of functions, these algorithms could take multiple epochs of data points to catch on.
Herein, a different technique with the potential of achieving dramatically better speeds of convergence is explored: it does not curve-follow but rather relies on 'decoupling' hidden layers and on
arXiv Detail & Related papers (2023-05-04T18:28:18Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Semantic Strengthening of Neuro-Symbolic Learning [85.6195120593625]
Neuro-symbolic approaches typically resort to fuzzy approximations of a probabilistic objective.
We show how to compute this efficiently for tractable circuits.
We test our approach on three tasks: predicting a minimum-cost path in Warcraft, predicting a minimum-cost perfect matching, and solving Sudoku puzzles.
arXiv Detail & Related papers (2023-02-28T00:04:22Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Convergence rates for gradient descent in the training of
overparameterized artificial neural networks with biases [3.198144010381572]
In recent years, artificial neural networks have developed into a powerful tool for dealing with a multitude of problems for which classical solution approaches.
It is still unclear why randomly gradient descent algorithms reach their limits.
arXiv Detail & Related papers (2021-02-23T18:17:47Z) - Short-Term Memory Optimization in Recurrent Neural Networks by
Autoencoder-based Initialization [79.42778415729475]
We explore an alternative solution based on explicit memorization using linear autoencoders for sequences.
We show how such pretraining can better support solving hard classification tasks with long sequences.
We show that the proposed approach achieves a much lower reconstruction error for long sequences and a better gradient propagation during the finetuning phase.
arXiv Detail & Related papers (2020-11-05T14:57:16Z) - Variance Reduction for Deep Q-Learning using Stochastic Recursive
Gradient [51.880464915253924]
Deep Q-learning algorithms often suffer from poor gradient estimations with an excessive variance.
This paper introduces the framework for updating the gradient estimates in deep Q-learning, achieving a novel algorithm called SRG-DQN.
arXiv Detail & Related papers (2020-07-25T00:54:20Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Semi-Implicit Back Propagation [1.5533842336139065]
We propose a semi-implicit back propagation method for neural network training.
The difference on the neurons are propagated in a backward fashion and the parameters are updated with proximal mapping.
Experiments on both MNIST and CIFAR-10 demonstrate that the proposed algorithm leads to better performance in terms of both loss decreasing and training/validation accuracy.
arXiv Detail & Related papers (2020-02-10T03:26:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.