Related papers: Few-Bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction

Few-Bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction

URL: http://arxiv.org/abs/2202.00441v2
Date: Wed, 2 Feb 2022 21:21:36 GMT
Title: Few-Bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction
Authors: Georgii Novikov, Daniel Bershatsky, Julia Gusak, Alex Shonenkov, Denis Dimitrov, and Ivan Oseledets
Abstract summary: Memory footprint is one of the main limiting factors for large neural network training. We propose a systematic approach to compute optimal quantization of the retained gradients of the pointwise nonlinear functions. We show that such approximation can be achieved by computing optimal piecewise-constant approximation of the derivative of the activation function.
Score: 4.243810214656324
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Memory footprint is one of the main limiting factors for large neural network training. In backpropagation, one needs to store the input to each operation in the computational graph. Every modern neural network model has quite a few pointwise nonlinearities in its architecture, and such operation induces additional memory costs which -- as we show -- can be significantly reduced by quantization of the gradients. We propose a systematic approach to compute optimal quantization of the retained gradients of the pointwise nonlinear functions with only a few bits per each element. We show that such approximation can be achieved by computing optimal piecewise-constant approximation of the derivative of the activation function, which can be done by dynamic programming. The drop-in replacements are implemented for all popular nonlinearities and can be used in any existing pipeline. We confirm the memory reduction and the same convergence on several open benchmarks.

Related papers

Learning Optical Flow Field via Neural Ordinary Differential Equation [44.16275288019991]
Recent works on optical flow estimation use neural networks to predict the flow field that maps positions of one image to positions of the other.<n>We introduce a novel approach for predicting the derivative of the flow using a continuous model, namely neural ordinary differential equations (ODE)
arXiv Detail & Related papers (2025-06-03T18:30:14Z)
efunc: An Efficient Function Representation without Neural Networks [46.76882780184126]
We propose a novel framework for continuous function modeling. Most existing works can be formulated using this framework.<n>We then introduce a compact function representation, which is based on parameter-efficient functions bypassing both neural networks and complex structures.
arXiv Detail & Related papers (2025-05-27T15:16:56Z)
TensorGRaD: Tensor Gradient Robust Decomposition for Memory-Efficient Neural Operator Training [91.8932638236073]
We introduce textbfTensorGRaD, a novel method that directly addresses the memory challenges associated with large-structured weights.<n>We show that sparseGRaD reduces total memory usage by over $50%$ while maintaining and sometimes even improving accuracy.
arXiv Detail & Related papers (2025-01-04T20:51:51Z)
Inverted Activations: Reducing Memory Footprint in Neural Network Training [5.070981175240306]
A significant challenge in neural network training is the memory footprint associated with activation tensors. We propose a modification to the handling of activation tensors in pointwise nonlinearity layers. We show that our method significantly reduces memory usage without affecting training accuracy or computational performance.
arXiv Detail & Related papers (2024-07-22T11:11:17Z)
Nonlinear functional regression by functional deep neural network with kernel embedding [20.306390874610635]
We propose a functional deep neural network with an efficient and fully data-dependent dimension reduction method. The architecture of our functional net consists of a kernel embedding step, a projection step, and a deep ReLU neural network for the prediction. The utilization of smooth kernel embedding enables our functional net to be discretization invariant, efficient, and robust to noisy observations.
arXiv Detail & Related papers (2024-01-05T16:43:39Z)
Pruning Convolutional Filters via Reinforcement Learning with Entropy Minimization [0.0]
We introduce a novel information-theoretic reward function which minimizes the spatial entropy of convolutional activations. Our method shows that there is another possibility to preserve accuracy without the need to directly optimize it in the agent's reward function.
arXiv Detail & Related papers (2023-12-08T09:34:57Z)
Globally Optimal Training of Neural Networks with Threshold Activation Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations. We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z)
Low-memory stochastic backpropagation with multi-channel randomized trace estimation [6.985273194899884]
We propose to approximate the gradient of convolutional layers in neural networks with a multi-channel randomized trace estimation technique. Compared to other methods, this approach is simple, amenable to analyses, and leads to a greatly reduced memory footprint. We discuss the performance of networks trained with backpropagation and how the error can be controlled while maximizing memory usage and minimizing computational overhead.
arXiv Detail & Related papers (2021-06-13T13:54:02Z)
Learning Frequency Domain Approximation for Binary Neural Networks [68.79904499480025]
We propose to estimate the gradient of sign function in the Fourier frequency domain using the combination of sine functions for training BNNs. The experiments on several benchmark datasets and neural architectures illustrate that the binary network learned using our method achieves the state-of-the-art accuracy.
arXiv Detail & Related papers (2021-03-01T08:25:26Z)
GradInit: Learning to Initialize Neural Networks for Stable and Efficient Training [59.160154997555956]
We present GradInit, an automated and architecture method for initializing neural networks. It is based on a simple agnostic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value. It also enables training the original Post-LN Transformer for machine translation without learning rate warmup.
arXiv Detail & Related papers (2021-02-16T11:45:35Z)
Activation Relaxation: A Local Dynamical Approximation to Backpropagation in the Brain [62.997667081978825]
Activation Relaxation (AR) is motivated by constructing the backpropagation gradient as the equilibrium point of a dynamical system. Our algorithm converges rapidly and robustly to the correct backpropagation gradients, requires only a single type of computational unit, and can operate on arbitrary computation graphs.
arXiv Detail & Related papers (2020-09-11T11:56:34Z)
Channel-Directed Gradients for Optimization of Convolutional Neural Networks [50.34913837546743]
We introduce optimization methods for convolutional neural networks that can be used to improve existing gradient-based optimization in terms of generalization error. We show that defining the gradients along the output channel direction leads to a performance boost, while other directions can be detrimental.
arXiv Detail & Related papers (2020-08-25T00:44:09Z)
Randomized Automatic Differentiation [22.95414996614006]
We develop a general framework and approach for randomized automatic differentiation (RAD) RAD can allow unbiased estimates to be computed with reduced memory in return for variance. We show that RAD converges in fewer iterations than using a small batch size for feedforward networks, and in a similar number for recurrent networks.
arXiv Detail & Related papers (2020-07-20T19:03:44Z)
Efficient Learning of Generative Models via Finite-Difference Score Matching [111.55998083406134]
We present a generic strategy to efficiently approximate any-order directional derivative with finite difference. Our approximation only involves function evaluations, which can be executed in parallel, and no gradient computations.
arXiv Detail & Related papers (2020-07-07T10:05:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.