Neural Transducer Training: Reduced Memory Consumption with Sample-wise
Computation
- URL: http://arxiv.org/abs/2211.16270v1
- Date: Tue, 29 Nov 2022 14:57:23 GMT
- Title: Neural Transducer Training: Reduced Memory Consumption with Sample-wise
Computation
- Authors: Stefan Braun, Erik McDermott, Roger Hsiao
- Abstract summary: We propose a memory-efficient training method that computes the transducer loss and gradients sample by sample.
We show that our sample-wise method significantly reduces memory usage, and performs at competitive speed when compared to the default batched.
As a highlight, we manage to compute the transducer loss and gradients for a batch size of 1024, and audio length of 40 seconds, using only 6 GB of memory.
- Score: 5.355990925686149
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The neural transducer is an end-to-end model for automatic speech recognition
(ASR). While the model is well-suited for streaming ASR, the training process
remains challenging. During training, the memory requirements may quickly
exceed the capacity of state-of-the-art GPUs, limiting batch size and sequence
lengths. In this work, we analyze the time and space complexity of a typical
transducer training setup. We propose a memory-efficient training method that
computes the transducer loss and gradients sample by sample. We present
optimizations to increase the efficiency and parallelism of the sample-wise
method. In a set of thorough benchmarks, we show that our sample-wise method
significantly reduces memory usage, and performs at competitive speed when
compared to the default batched computation. As a highlight, we manage to
compute the transducer loss and gradients for a batch size of 1024, and audio
length of 40 seconds, using only 6 GB of memory.
Related papers
- EMP: Enhance Memory in Data Pruning [18.535687216213628]
Recently, large language and vision models have shown strong performance, but due to high pre-training and fine-tuning costs, research has shifted towards faster training via dataset pruning.
Previous methods used sample loss as an evaluation criterion, aiming to select the most "difficult" samples for training.
We propose Enhance Memory Pruning (EMP), which addresses the issue of insufficient memory under high pruning rates by enhancing the model's memory of data, thereby improving its performance.
arXiv Detail & Related papers (2024-08-28T10:29:52Z) - Efficient NeRF Optimization -- Not All Samples Remain Equally Hard [9.404889815088161]
We propose an application of online hard sample mining for efficient training of Neural Radiance Fields (NeRF)
NeRF models produce state-of-the-art quality for many 3D reconstruction and rendering tasks but require substantial computational resources.
arXiv Detail & Related papers (2024-08-06T13:49:01Z) - UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs [111.12010207132204]
UIO-LLMs is an incremental optimization approach for memory-enhanced transformers under long-context settings.
We refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm.
UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters.
arXiv Detail & Related papers (2024-06-26T08:44:36Z) - Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone.
We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Towards Memory- and Time-Efficient Backpropagation for Training Spiking
Neural Networks [70.75043144299168]
Spiking Neural Networks (SNNs) are promising energy-efficient models for neuromorphic computing.
We propose the Spatial Learning Through Time (SLTT) method that can achieve high performance while greatly improving training efficiency.
Our method achieves state-of-the-art accuracy on ImageNet, while the memory cost and training time are reduced by more than 70% and 50%, respectively, compared with BPTT.
arXiv Detail & Related papers (2023-02-28T05:01:01Z) - On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory.
Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z) - Memory-Efficient Training of RNN-Transducer with Sampled Softmax [30.55020578002442]
We propose to apply sampled softmax to RNN-Transducer, which requires only a small subset of vocabulary during training thus saves its memory consumption.
We present experimental results on LibriSpeech, AISHELL-1, and CSJ-APS, where sampled softmax greatly reduces memory consumption and still maintains the accuracy of the baseline model.
arXiv Detail & Related papers (2022-03-31T07:51:43Z) - Layered gradient accumulation and modular pipeline parallelism: fast and
efficient training of large language models [0.0]
We analyse the shortest possible training time for different configurations of distributed training.
We introduce two new methods, textitlayered gradient accumulation and textitmodular pipeline parallelism, which together cut the shortest training time by half.
arXiv Detail & Related papers (2021-06-04T19:21:49Z) - Predicting Training Time Without Training [120.92623395389255]
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function.
We leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model.
We are able to predict the time it takes to fine-tune a model to a given loss without having to perform any training.
arXiv Detail & Related papers (2020-08-28T04:29:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.