Memory Optimization for Deep Networks
- URL: http://arxiv.org/abs/2010.14501v3
- Date: Sat, 3 Apr 2021 00:32:47 GMT
- Title: Memory Optimization for Deep Networks
- Authors: Aashaka Shah, Chao-Yuan Wu, Jayashree Mohan, Vijay Chidambaram,
Philipp Kr\"ahenb\"uhl
- Abstract summary: We present MONeT, an automatic framework that minimizes both the memory footprint and computational overhead of deep networks.
MoneT reduces the overall memory requirement by 3x for various PyTorch models, with a 9-16% overhead in computation.
For the same computation cost, MONeT requires 1.2-1.8x less memory than current state-of-the-art automated checkpointing frameworks.
- Score: 10.519610439720909
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning is slowly, but steadily, hitting a memory bottleneck. While the
tensor computation in top-of-the-line GPUs increased by 32x over the last five
years, the total available memory only grew by 2.5x. This prevents researchers
from exploring larger architectures, as training large networks requires more
memory for storing intermediate outputs. In this paper, we present MONeT, an
automatic framework that minimizes both the memory footprint and computational
overhead of deep networks. MONeT jointly optimizes the checkpointing schedule
and the implementation of various operators. MONeT is able to outperform all
prior hand-tuned operations as well as automated checkpointing. MONeT reduces
the overall memory requirement by 3x for various PyTorch models, with a 9-16%
overhead in computation. For the same computation cost, MONeT requires 1.2-1.8x
less memory than current state-of-the-art automated checkpointing frameworks.
Our code is available at https://github.com/utsaslab/MONeT.
Related papers
- Memory-Efficient Training for Deep Speaker Embedding Learning in Speaker Verification [50.596077598766975]
We explore a memory-efficient training strategy for deep speaker embedding learning in resource-constrained scenarios.
For activations, we design two types of reversible neural networks which eliminate the need to store intermediate activations.
For states, we introduce a dynamic quantization approach that replaces the original 32-bit floating-point values with a dynamic tree-based 8-bit data type.
arXiv Detail & Related papers (2024-12-02T06:57:46Z) - Cut Your Losses in Large-Vocabulary Language Models [102.6981011879656]
We propose Cut Cross-Entropy (CCE), a method that computes the cross-entropy loss without materializing the logits for all tokens into global memory.
CCE reduces the memory footprint of the loss from 24 GB to 1 MB, and the total training-time memory consumption of the head from 28 GB to 1 GB.
arXiv Detail & Related papers (2024-11-13T20:30:15Z) - Less Memory Means smaller GPUs: Backpropagation with Compressed Activations [1.7065506903618906]
The ever-growing scale of deep neural networks (DNNs) has lead to an equally rapid growth in computational resource requirements.
Many recent architectures, most prominently Large Language Models, have to be trained using supercomputers with thousands of accelerators.
With this approach we are able to reduce the peak memory consumption by 29% at the cost of a longer training schedule.
arXiv Detail & Related papers (2024-09-18T11:57:05Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - Compute Better Spent: Replacing Dense Layers with Structured Matrices [77.61728033234233]
We identify more efficient alternatives to dense matrices, as exemplified by the success of convolutional networks in the image domain.
We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance.
We propose a novel matrix family containing Monarch matrices, the Block-Train, which we show performs better than dense for the same compute on multiple tasks.
arXiv Detail & Related papers (2024-06-10T13:25:43Z) - AI and Memory Wall [81.06494558184049]
We show how memory bandwidth can become the dominant bottleneck for decoder models.
We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.
arXiv Detail & Related papers (2024-03-21T04:31:59Z) - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning [72.80896338009579]
We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs.
We propose a generic patch-by-patch inference scheduling, which significantly cuts down the peak memory.
We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2.
arXiv Detail & Related papers (2021-10-28T17:58:45Z) - Efficient and Generic 1D Dilated Convolution Layer for Deep Learning [52.899995651639436]
We introduce our efficient implementation of a generic 1D convolution layer covering a wide range of parameters.
It is optimized for x86 CPU architectures, in particular, for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions.
We demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets.
arXiv Detail & Related papers (2021-04-16T09:54:30Z) - Out-of-core Training for Extremely Large-Scale Neural Networks With
Adaptive Window-Based Scheduling [4.903820815918411]
We propose a novel out-of-core algorithm that enables faster training of extremely large-scale neural networks with sizes larger than allotted GPU memory.
We apply virtual addressing technique, commonly performed in OS, to training of neural networks with out-of-core execution.
We successfully train ResNet-50 with 1440 batch-size with keeping training speed at 55%, which is 7.5x larger than the upper bound of physical memory.
arXiv Detail & Related papers (2020-10-27T07:40:04Z) - Improving compute efficacy frontiers with SliceOut [31.864949424541344]
We introduce SliceOut -- a dropout-inspired scheme to train deep learning models faster without impacting final test accuracy.
At test time, turning off SliceOut performs an implicit ensembling across a linear number of architectures that preserves test accuracy.
This leads to faster processing of large computational workloads overall, and significantly reduce the resulting energy consumption and CO2emissions.
arXiv Detail & Related papers (2020-07-21T15:59:09Z) - Efficient Neural Network Deployment for Microcontroller [0.0]
This paper is going to explore and generalize convolution neural network deployment for microcontrollers.
The memory savings and performance will be compared with CMSIS-NN framework developed for ARM Cortex-M CPUs.
The final purpose is to develop a tool consuming PyTorch model with trained network weights, and it turns into an optimized inference engine in C/C++ for low memory(kilobyte level) and limited computing capable microcontrollers.
arXiv Detail & Related papers (2020-07-02T19:21:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.