Memory-Efficient Backpropagation through Large Linear Layers
- URL: http://arxiv.org/abs/2201.13195v3
- Date: Wed, 2 Feb 2022 21:24:49 GMT
- Title: Memory-Efficient Backpropagation through Large Linear Layers
- Authors: Daniel Bershatsky, Aleksandr Mikhalev, Alexandr Katrutsa, Julia Gusak,
Daniil Merkulov and Ivan Oseledets
- Abstract summary: In modern neural networks like Transformers, linear layers require significant memory to store activations during backward pass.
This study proposes a memory reduction approach to perform backpropagation through linear layers.
- Score: 107.20037639738433
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In modern neural networks like Transformers, linear layers require
significant memory to store activations during backward pass. This study
proposes a memory reduction approach to perform backpropagation through linear
layers. Since the gradients of linear layers are computed by matrix
multiplications, we consider methods for randomized matrix multiplications and
demonstrate that they require less memory with a moderate decrease of the test
accuracy. Also, we investigate the variance of the gradient estimate induced by
the randomized matrix multiplication. We compare this variance with the
variance coming from gradient estimation based on the batch of samples. We
demonstrate the benefits of the proposed method on the fine-tuning of the
pre-trained RoBERTa model on GLUE tasks.
Related papers
- BALI: Learning Neural Networks via Bayesian Layerwise Inference [6.7819070167076045]
We introduce a new method for learning Bayesian neural networks, treating them as a stack of multivariate Bayesian linear regression models.
The main idea is to infer the layerwise posterior exactly if we know the target outputs of each layer.
We define these pseudo-targets as the layer outputs from the forward pass, updated by the backpropagated of the objective function.
arXiv Detail & Related papers (2024-11-18T22:18:34Z) - An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks.
The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions.
We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z) - Hebbian learning inspired estimation of the linear regression parameters
from queries [18.374824005225186]
We study a variation of this Hebbian learning rule to recover the regression vector in the linear regression model.
We prove that this Hebbian learning rule can achieve considerably faster rates than any non-adaptive method that selects the queries independently of the data.
arXiv Detail & Related papers (2023-09-26T19:00:32Z) - Low-rank extended Kalman filtering for online learning of neural
networks from streaming data [71.97861600347959]
We propose an efficient online approximate Bayesian inference algorithm for estimating the parameters of a nonlinear function from a potentially non-stationary data stream.
The method is based on the extended Kalman filter (EKF), but uses a novel low-rank plus diagonal decomposition of the posterior matrix.
In contrast to methods based on variational inference, our method is fully deterministic, and does not require step-size tuning.
arXiv Detail & Related papers (2023-05-31T03:48:49Z) - Neural incomplete factorization: learning preconditioners for the conjugate gradient method [2.899792823251184]
We develop a data-driven approach to accelerate the generation of effective preconditioners.
We replace the typically hand-engineered preconditioners by the output of graph neural networks.
Our method generates an incomplete factorization of the matrix and is, therefore, referred to as neural incomplete factorization (NeuralIF)
arXiv Detail & Related papers (2023-05-25T11:45:46Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Graph Polynomial Convolution Models for Node Classification of
Non-Homophilous Graphs [52.52570805621925]
We investigate efficient learning from higher-order graph convolution and learning directly from adjacency matrix for node classification.
We show that the resulting model lead to new graphs and residual scaling parameter.
We demonstrate that the proposed methods obtain improved accuracy for node-classification of non-homophilous parameters.
arXiv Detail & Related papers (2022-09-12T04:46:55Z) - High-Dimensional Sparse Bayesian Learning without Covariance Matrices [66.60078365202867]
We introduce a new inference scheme that avoids explicit construction of the covariance matrix.
Our approach couples a little-known diagonal estimation result from numerical linear algebra with the conjugate gradient algorithm.
On several simulations, our method scales better than existing approaches in computation time and memory.
arXiv Detail & Related papers (2022-02-25T16:35:26Z) - Explainable nonlinear modelling of multiple time series with invertible
neural networks [7.605814048051735]
A method for nonlinear topology identification is proposed, based on the assumption that a collection of time series are generated in two steps.
The latter mappings are assumed invertible, and are modelled as shallow neural networks, so that their inverse can be numerically evaluated.
This paper explains the steps needed to calculate the gradients applying implicit differentiation.
arXiv Detail & Related papers (2021-07-01T12:07:09Z) - Meta-learning for Matrix Factorization without Shared Rows or Columns [39.56814839510978]
The proposed method uses a neural network that takes a matrix as input, and generates prior distributions of factorized matrices of the given matrix.
The neural network is meta-learned such that the expected imputation error is minimized.
In our experiments with three user-item rating datasets, we demonstrate that our proposed method can impute the missing values from a limited number of observations in unseen matrices.
arXiv Detail & Related papers (2021-06-29T07:40:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.