Related papers: Coop: Memory is not a Commodity

Coop: Memory is not a Commodity

URL: http://arxiv.org/abs/2311.00591v1
Date: Wed, 1 Nov 2023 15:35:51 GMT
Title: Coop: Memory is not a Commodity
Authors: Jianhao Zhang, Shihan Ma, Peihong Liu, Jinhui Yuan
Abstract summary: tensor rematerialization allows the training of deep neural networks (DNNs) under limited memory budgets. We propose to evict tensors within a sliding window to ensure all evictions are contiguous and are immediately used. We also propose cheap tensor partitioning and recomputable in-place to further reduce the rematerialization cost.
Score: 0.9667631210393929
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tensor rematerialization allows the training of deep neural networks (DNNs) under limited memory budgets by checkpointing the models and recomputing the evicted tensors as needed. However, the existing tensor rematerialization techniques overlook the memory system in deep learning frameworks and implicitly assume that free memory blocks at different addresses are identical. Under this flawed assumption, discontiguous tensors are evicted, among which some are not used to allocate the new tensor. This leads to severe memory fragmentation and increases the cost of potential rematerializations. To address this issue, we propose to evict tensors within a sliding window to ensure all evictions are contiguous and are immediately used. Furthermore, we proposed cheap tensor partitioning and recomputable in-place to further reduce the rematerialization cost by optimizing the tensor allocation. We named our method Coop as it is a co-optimization of tensor allocation and tensor rematerialization. We evaluated Coop on eight representative DNNs. The experimental results demonstrate that Coop achieves up to $2\times$ memory saving and hugely reduces compute overhead, search latency, and memory fragmentation compared to the state-of-the-art baselines.

Related papers

Tensor-GaLore: Memory-Efficient Training via Gradient Tensor Decomposition [93.98343072306619]
We present Navier-GaLore, a novel method for efficient training of neural networks with higher-order tensor weights. Across various PDE tasks, Navier-GaLore achieves substantial memory savings, reducing memory usage by up to 75%.
arXiv Detail & Related papers (2025-01-04T20:51:51Z)
Inverted Activations: Reducing Memory Footprint in Neural Network Training [5.070981175240306]
A significant challenge in neural network training is the memory footprint associated with activation tensors. We propose a modification to the handling of activation tensors in pointwise nonlinearity layers. We show that our method significantly reduces memory usage without affecting training accuracy or computational performance.
arXiv Detail & Related papers (2024-07-22T11:11:17Z)
Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models. Existing methods often compromise precision or require extra data for calibration. We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
DELTA: Dynamically Optimizing GPU Memory beyond Tensor Recomputation [29.804356645683463]
We propose a novel scheduler named DELTA for tensor swapping and tensor recomputation. We show that DELTA not only saves 40%-70% of GPU memory, surpassing the state-of-the-art method to a great extent.
arXiv Detail & Related papers (2022-03-30T01:40:25Z)
DropIT: Dropping Intermediate Tensors for Memory-Efficient DNN Training [29.02792751614279]
A standard hardware bottleneck when training deep neural networks is GPU memory. We propose a novel method to reduce this footprint by selecting and caching part of intermediate tensors for gradient computation. Experiments show that we can drop up to 90% of the elements of the intermediate tensors in convolutional and fully-connected layers, saving 20% GPU memory during training.
arXiv Detail & Related papers (2022-02-28T14:12:00Z)
Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers. Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training. Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z)
Efficient Tensor Completion via Element-wise Weighted Low-rank Tensor Train with Overlapping Ket Augmentation [18.438177637687357]
We propose a novel tensor completion approach via the element-wise weighted technique. We specifically consider the recovery quality of edge elements from adjacent blocks. Our experimental results demonstrate that the proposed algorithm TWMac-TT outperforms several other competing tensor completion methods.
arXiv Detail & Related papers (2021-09-13T06:50:37Z)
MTC: Multiresolution Tensor Completion from Partial and Coarse Observations [49.931849672492305]
Existing completion formulation mostly relies on partial observations from a single tensor. We propose an efficient Multi-resolution Completion model (MTC) to solve the problem.
arXiv Detail & Related papers (2021-06-14T02:20:03Z)
Multi-version Tensor Completion for Time-delayed Spatio-temporal Data [50.762087239885936]
Real-world-temporal data is often incomplete or inaccurate due to various data loading delays. We propose a low-rank tensor model to predict the updates over time. We obtain up to 27.2% lower root mean-squared-error compared to the best baseline method.
arXiv Detail & Related papers (2021-05-11T19:55:56Z)
Beyond Lazy Training for Over-parameterized Tensor Decomposition [69.4699995828506]
We show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data. Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
arXiv Detail & Related papers (2020-10-22T00:32:12Z)
Efficient Tensor Kernel methods for sparse regression [39.95662930240854]
We introduce suitable tensor kernels to promote sparsity in the solution of the underlying regression problem. storing tensors requires a considerable amount of memory, ultimately limiting its applicability. First, we directly reduce the memory requirement, by intriducing a new and more efficient layout for storing the data. Second, we use a Nystrom-type subsampling approach, which allows for a training phase with a smaller number of data points, so to reduce the computational cost.
arXiv Detail & Related papers (2020-03-23T18:26:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.