Few-Bit Backward: Quantized Gradients of Activation Functions for Memory
Footprint Reduction
- URL: http://arxiv.org/abs/2202.00441v2
- Date: Wed, 2 Feb 2022 21:21:36 GMT
- Title: Few-Bit Backward: Quantized Gradients of Activation Functions for Memory
Footprint Reduction
- Authors: Georgii Novikov, Daniel Bershatsky, Julia Gusak, Alex Shonenkov, Denis
Dimitrov, and Ivan Oseledets
- Abstract summary: Memory footprint is one of the main limiting factors for large neural network training.
We propose a systematic approach to compute optimal quantization of the retained gradients of the pointwise nonlinear functions.
We show that such approximation can be achieved by computing optimal piecewise-constant approximation of the derivative of the activation function.
- Score: 4.243810214656324
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Memory footprint is one of the main limiting factors for large neural network
training. In backpropagation, one needs to store the input to each operation in
the computational graph. Every modern neural network model has quite a few
pointwise nonlinearities in its architecture, and such operation induces
additional memory costs which -- as we show -- can be significantly reduced by
quantization of the gradients. We propose a systematic approach to compute
optimal quantization of the retained gradients of the pointwise nonlinear
functions with only a few bits per each element. We show that such
approximation can be achieved by computing optimal piecewise-constant
approximation of the derivative of the activation function, which can be done
by dynamic programming. The drop-in replacements are implemented for all
popular nonlinearities and can be used in any existing pipeline. We confirm the
memory reduction and the same convergence on several open benchmarks.
Related papers
- Inverted Activations: Reducing Memory Footprint in Neural Network Training [5.070981175240306]
A significant challenge in neural network training is the memory footprint associated with activation tensors.
We propose a modification to the handling of activation tensors in pointwise nonlinearity layers.
We show that our method significantly reduces memory usage without affecting training accuracy or computational performance.
arXiv Detail & Related papers (2024-07-22T11:11:17Z) - Nonlinear functional regression by functional deep neural network with
kernel embedding [20.306390874610635]
We propose a functional deep neural network with an efficient and fully data-dependent dimension reduction method.
The architecture of our functional net consists of a kernel embedding step, a projection step, and a deep ReLU neural network for the prediction.
The utilization of smooth kernel embedding enables our functional net to be discretization invariant, efficient, and robust to noisy observations.
arXiv Detail & Related papers (2024-01-05T16:43:39Z) - Pruning Convolutional Filters via Reinforcement Learning with Entropy
Minimization [0.0]
We introduce a novel information-theoretic reward function which minimizes the spatial entropy of convolutional activations.
Our method shows that there is another possibility to preserve accuracy without the need to directly optimize it in the agent's reward function.
arXiv Detail & Related papers (2023-12-08T09:34:57Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Low-memory stochastic backpropagation with multi-channel randomized
trace estimation [6.985273194899884]
We propose to approximate the gradient of convolutional layers in neural networks with a multi-channel randomized trace estimation technique.
Compared to other methods, this approach is simple, amenable to analyses, and leads to a greatly reduced memory footprint.
We discuss the performance of networks trained with backpropagation and how the error can be controlled while maximizing memory usage and minimizing computational overhead.
arXiv Detail & Related papers (2021-06-13T13:54:02Z) - Learning Frequency Domain Approximation for Binary Neural Networks [68.79904499480025]
We propose to estimate the gradient of sign function in the Fourier frequency domain using the combination of sine functions for training BNNs.
The experiments on several benchmark datasets and neural architectures illustrate that the binary network learned using our method achieves the state-of-the-art accuracy.
arXiv Detail & Related papers (2021-03-01T08:25:26Z) - GradInit: Learning to Initialize Neural Networks for Stable and
Efficient Training [59.160154997555956]
We present GradInit, an automated and architecture method for initializing neural networks.
It is based on a simple agnostic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value.
It also enables training the original Post-LN Transformer for machine translation without learning rate warmup.
arXiv Detail & Related papers (2021-02-16T11:45:35Z) - Activation Relaxation: A Local Dynamical Approximation to
Backpropagation in the Brain [62.997667081978825]
Activation Relaxation (AR) is motivated by constructing the backpropagation gradient as the equilibrium point of a dynamical system.
Our algorithm converges rapidly and robustly to the correct backpropagation gradients, requires only a single type of computational unit, and can operate on arbitrary computation graphs.
arXiv Detail & Related papers (2020-09-11T11:56:34Z) - Channel-Directed Gradients for Optimization of Convolutional Neural
Networks [50.34913837546743]
We introduce optimization methods for convolutional neural networks that can be used to improve existing gradient-based optimization in terms of generalization error.
We show that defining the gradients along the output channel direction leads to a performance boost, while other directions can be detrimental.
arXiv Detail & Related papers (2020-08-25T00:44:09Z) - Randomized Automatic Differentiation [22.95414996614006]
We develop a general framework and approach for randomized automatic differentiation (RAD)
RAD can allow unbiased estimates to be computed with reduced memory in return for variance.
We show that RAD converges in fewer iterations than using a small batch size for feedforward networks, and in a similar number for recurrent networks.
arXiv Detail & Related papers (2020-07-20T19:03:44Z) - Efficient Learning of Generative Models via Finite-Difference Score
Matching [111.55998083406134]
We present a generic strategy to efficiently approximate any-order directional derivative with finite difference.
Our approximation only involves function evaluations, which can be executed in parallel, and no gradient computations.
arXiv Detail & Related papers (2020-07-07T10:05:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.