Related papers: Lowering PyTorch's Memory Consumption for Selective Differentiation

Lowering PyTorch's Memory Consumption for Selective Differentiation

URL: http://arxiv.org/abs/2404.12406v2
Date: Wed, 21 Aug 2024 06:21:52 GMT
Title: Lowering PyTorch's Memory Consumption for Selective Differentiation
Authors: Samarth Bhatia, Felix Dangel,
Abstract summary: PyTorch's current AD implementation neglects information about parameter differentiability when storing the graph. We provide a drop-in, differentiability-agnostic implementation of such layers and demonstrate its ability to reduce memory without affecting run time.
Score: 2.424775261485421
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Memory is a limiting resource for many deep learning tasks. Beside the neural network weights, one main memory consumer is the computation graph built up by automatic differentiation (AD) for backpropagation. We observe that PyTorch's current AD implementation neglects information about parameter differentiability when storing the computation graph. This information is useful though to reduce memory whenever gradients are requested for a parameter subset, as is the case in many modern fine-tuning tasks. Specifically, inputs to layers that act linearly in their parameters (dense, convolution, or normalization layers) can be discarded whenever the parameters are marked as non-differentiable. We provide a drop-in, differentiability-agnostic implementation of such layers and demonstrate its ability to reduce memory without affecting run time.

Related papers

Leveraging Lightweight Generators for Memory Efficient Continual Learning [0.01874930567916036]
Catastrophic forgetting can be trivially alleviated by keeping all data from previous tasks in memory.<n>This paper aims to decrease required memory for memory-based continuous learning algorithms.
arXiv Detail & Related papers (2025-06-24T14:59:52Z)
Memory Layers at Scale [67.00854080570979]
This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale. On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the budget, as well as mixture-of-expert models when matched for both compute and parameters. We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.
arXiv Detail & Related papers (2024-12-12T23:56:57Z)
Sparser Training for On-Device Recommendation Systems [50.74019319100728]
We propose SparseRec, a lightweight embedding method based on Dynamic Sparse Training (DST) It avoids dense gradients during backpropagation by sampling a subset of important vectors.
arXiv Detail & Related papers (2024-11-19T03:48:48Z)
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)
Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagation [29.139579820699495]
This work strives to reduce memory overhead in fine-tuning from perspectives of activation function and layer normalization. We apply our Approx-BP theory to backpropagation training and derive memory-efficient alternatives of GELU and SiLU activation functions. In addition, we introduce a Memory-Sharing Backpropagation strategy, which enables the activation memory to be shared by two adjacent layers.
arXiv Detail & Related papers (2024-06-24T03:09:15Z)
AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models. AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
Nesting Forward Automatic Differentiation for Memory-Efficient Deep Neural Network Training [23.536294640280087]
We propose the nested forward automatic differentiation (Forward-AD) for the element-wise activation function for memory-efficient training. Our evaluation shows that nested Forward-AD reduces the memory footprint up to 1.97x than the baseline model.
arXiv Detail & Related papers (2022-09-22T04:48:48Z)
A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental Learning [56.450090618578]
Class-Incremental Learning (CIL) aims to train a model with limited memory size to meet this requirement. We show that when counting the model size into the total budget and comparing methods with aligned memory size, saving models do not consistently work. We propose a simple yet effective baseline, denoted as MEMO for Memory-efficient Expandable MOdel.
arXiv Detail & Related papers (2022-05-26T08:24:01Z)
Few-Bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction [4.243810214656324]
Memory footprint is one of the main limiting factors for large neural network training. We propose a systematic approach to compute optimal quantization of the retained gradients of the pointwise nonlinear functions. We show that such approximation can be achieved by computing optimal piecewise-constant approximation of the derivative of the activation function.
arXiv Detail & Related papers (2022-02-01T14:51:38Z)
Analysis of memory consumption by neural networks based on hyperparameters [0.0]
We propose a generic analysis of memory consumption while training deep learning models. The change in hyperparamaters and the number of hidden layers are the variables considered in this proposed approach.
arXiv Detail & Related papers (2021-10-21T18:49:44Z)
Kanerva++: extending The Kanerva Machine with differentiable, locally block allocated latent memory [75.65949969000596]
Episodic and semantic memory are critical components of the human memory model. We develop a new principled Bayesian memory allocation scheme that bridges the gap between episodic and semantic memory. We demonstrate that this allocation scheme improves performance in memory conditional image generation.
arXiv Detail & Related papers (2021-02-20T18:40:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.