Related papers: Neural Transducer Training: Reduced Memory Consumption with Sample-wise Computation

Neural Transducer Training: Reduced Memory Consumption with Sample-wise Computation

URL: http://arxiv.org/abs/2211.16270v1
Date: Tue, 29 Nov 2022 14:57:23 GMT
Title: Neural Transducer Training: Reduced Memory Consumption with Sample-wise Computation
Authors: Stefan Braun, Erik McDermott, Roger Hsiao
Abstract summary: We propose a memory-efficient training method that computes the transducer loss and gradients sample by sample. We show that our sample-wise method significantly reduces memory usage, and performs at competitive speed when compared to the default batched. As a highlight, we manage to compute the transducer loss and gradients for a batch size of 1024, and audio length of 40 seconds, using only 6 GB of memory.
Score: 5.355990925686149
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The neural transducer is an end-to-end model for automatic speech recognition (ASR). While the model is well-suited for streaming ASR, the training process remains challenging. During training, the memory requirements may quickly exceed the capacity of state-of-the-art GPUs, limiting batch size and sequence lengths. In this work, we analyze the time and space complexity of a typical transducer training setup. We propose a memory-efficient training method that computes the transducer loss and gradients sample by sample. We present optimizations to increase the efficiency and parallelism of the sample-wise method. In a set of thorough benchmarks, we show that our sample-wise method significantly reduces memory usage, and performs at competitive speed when compared to the default batched computation. As a highlight, we manage to compute the transducer loss and gradients for a batch size of 1024, and audio length of 40 seconds, using only 6 GB of memory.

Related papers

EMP: Enhance Memory in Data Pruning [18.535687216213628]
Recently, large language and vision models have shown strong performance, but due to high pre-training and fine-tuning costs, research has shifted towards faster training via dataset pruning. Previous methods used sample loss as an evaluation criterion, aiming to select the most "difficult" samples for training. We propose Enhance Memory Pruning (EMP), which addresses the issue of insufficient memory under high pruning rates by enhancing the model's memory of data, thereby improving its performance.
arXiv Detail & Related papers (2024-08-28T10:29:52Z)
Efficient NeRF Optimization -- Not All Samples Remain Equally Hard [9.404889815088161]
We propose an application of online hard sample mining for efficient training of Neural Radiance Fields (NeRF) NeRF models produce state-of-the-art quality for many 3D reconstruction and rendering tasks but require substantial computational resources.
arXiv Detail & Related papers (2024-08-06T13:49:01Z)
UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs [111.12010207132204]
UIO-LLMs is an incremental optimization approach for memory-enhanced transformers under long-context settings. We refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm. UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters.
arXiv Detail & Related papers (2024-06-26T08:44:36Z)
Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone. We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
Towards Memory- and Time-Efficient Backpropagation for Training Spiking Neural Networks [70.75043144299168]
Spiking Neural Networks (SNNs) are promising energy-efficient models for neuromorphic computing. We propose the Spatial Learning Through Time (SLTT) method that can achieve high performance while greatly improving training efficiency. Our method achieves state-of-the-art accuracy on ImageNet, while the memory cost and training time are reduced by more than 70% and 50%, respectively, compared with BPTT.
arXiv Detail & Related papers (2023-02-28T05:01:01Z)
On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory. Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z)
Memory-Efficient Training of RNN-Transducer with Sampled Softmax [30.55020578002442]
We propose to apply sampled softmax to RNN-Transducer, which requires only a small subset of vocabulary during training thus saves its memory consumption. We present experimental results on LibriSpeech, AISHELL-1, and CSJ-APS, where sampled softmax greatly reduces memory consumption and still maintains the accuracy of the baseline model.
arXiv Detail & Related papers (2022-03-31T07:51:43Z)
Layered gradient accumulation and modular pipeline parallelism: fast and efficient training of large language models [0.0]
We analyse the shortest possible training time for different configurations of distributed training. We introduce two new methods, textitlayered gradient accumulation and textitmodular pipeline parallelism, which together cut the shortest training time by half.
arXiv Detail & Related papers (2021-06-04T19:21:49Z)
Predicting Training Time Without Training [120.92623395389255]
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function. We leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model. We are able to predict the time it takes to fine-tune a model to a given loss without having to perform any training.
arXiv Detail & Related papers (2020-08-28T04:29:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.