Memory-Efficient Training of RNN-Transducer with Sampled Softmax
- URL: http://arxiv.org/abs/2203.16868v1
- Date: Thu, 31 Mar 2022 07:51:43 GMT
- Title: Memory-Efficient Training of RNN-Transducer with Sampled Softmax
- Authors: Jaesong Lee, Lukas Lee, Shinji Watanabe
- Abstract summary: We propose to apply sampled softmax to RNN-Transducer, which requires only a small subset of vocabulary during training thus saves its memory consumption.
We present experimental results on LibriSpeech, AISHELL-1, and CSJ-APS, where sampled softmax greatly reduces memory consumption and still maintains the accuracy of the baseline model.
- Score: 30.55020578002442
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: RNN-Transducer has been one of promising architectures for end-to-end
automatic speech recognition. Although RNN-Transducer has many advantages
including its strong accuracy and streaming-friendly property, its high memory
consumption during training has been a critical problem for development. In
this work, we propose to apply sampled softmax to RNN-Transducer, which
requires only a small subset of vocabulary during training thus saves its
memory consumption. We further extend sampled softmax to optimize memory
consumption for a minibatch, and employ distributions of auxiliary CTC losses
for sampling vocabulary to improve model accuracy. We present experimental
results on LibriSpeech, AISHELL-1, and CSJ-APS, where sampled softmax greatly
reduces memory consumption and still maintains the accuracy of the baseline
model.
Related papers
- Efficient NeRF Optimization -- Not All Samples Remain Equally Hard [9.404889815088161]
We propose an application of online hard sample mining for efficient training of Neural Radiance Fields (NeRF)
NeRF models produce state-of-the-art quality for many 3D reconstruction and rendering tasks but require substantial computational resources.
arXiv Detail & Related papers (2024-08-06T13:49:01Z) - UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs [111.12010207132204]
UIO-LLMs is an incremental optimization approach for memory-enhanced transformers under long-context settings.
We refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm.
UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters.
arXiv Detail & Related papers (2024-06-26T08:44:36Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Neural Transducer Training: Reduced Memory Consumption with Sample-wise
Computation [5.355990925686149]
We propose a memory-efficient training method that computes the transducer loss and gradients sample by sample.
We show that our sample-wise method significantly reduces memory usage, and performs at competitive speed when compared to the default batched.
As a highlight, we manage to compute the transducer loss and gradients for a batch size of 1024, and audio length of 40 seconds, using only 6 GB of memory.
arXiv Detail & Related papers (2022-11-29T14:57:23Z) - On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory.
Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z) - Variable Bitrate Neural Fields [75.24672452527795]
We present a dictionary method for compressing feature grids, reducing their memory consumption by up to 100x.
We formulate the dictionary optimization as a vector-quantized auto-decoder problem which lets us learn end-to-end discrete neural representations in a space where no direct supervision is available.
arXiv Detail & Related papers (2022-06-15T17:58:34Z) - Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers.
Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training.
Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z) - Not Far Away, Not So Close: Sample Efficient Nearest Neighbour Data
Augmentation via MiniMax [7.680863481076596]
MiniMax-kNN is a sample efficient data augmentation strategy.
We exploit a semi-supervised approach based on knowledge distillation to train a model on augmented data.
arXiv Detail & Related papers (2021-05-28T06:32:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.