Minimum Latency Training of Sequence Transducers for Streaming
End-to-End Speech Recognition
- URL: http://arxiv.org/abs/2211.02333v1
- Date: Fri, 4 Nov 2022 09:19:59 GMT
- Title: Minimum Latency Training of Sequence Transducers for Streaming
End-to-End Speech Recognition
- Authors: Yusuke Shinohara and Shinji Watanabe
- Abstract summary: We propose a new training method to explicitly model and reduce the latency of sequence transducer models.
Experimental results show that the proposed minimum latency training reduces the latency of causal Conformer-T from 220 ms to 27 ms within a WER degradation of 0.7%.
- Score: 38.28868751443619
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sequence transducers, such as the RNN-T and the Conformer-T, are one of the
most promising models of end-to-end speech recognition, especially in streaming
scenarios where both latency and accuracy are important. Although various
methods, such as alignment-restricted training and FastEmit, have been studied
to reduce the latency, latency reduction is often accompanied with a
significant degradation in accuracy. We argue that this suboptimal performance
might be caused because none of the prior methods explicitly model and reduce
the latency. In this paper, we propose a new training method to explicitly
model and reduce the latency of sequence transducer models. First, we define
the expected latency at each diagonal line on the lattice, and show that its
gradient can be computed efficiently within the forward-backward algorithm.
Then we augment the transducer loss with this expected latency, so that an
optimal trade-off between latency and accuracy is achieved. Experimental
results on the WSJ dataset show that the proposed minimum latency training
reduces the latency of causal Conformer-T from 220 ms to 27 ms within a WER
degradation of 0.7%, and outperforms conventional alignment-restricted training
(110 ms) and FastEmit (67 ms) methods.
Related papers
- OFDM-Standard Compatible SC-NOFS Waveforms for Low-Latency and Jitter-Tolerance Industrial IoT Communications [53.398544571833135]
This work proposes a spectrally efficient irregular Sinc (irSinc) shaping technique, revisiting the traditional Sinc back to 1924.
irSinc yields a signal with increased spectral efficiency without sacrificing error performance.
Our signal achieves faster data transmission within the same spectral bandwidth through 5G standard signal configuration.
arXiv Detail & Related papers (2024-06-07T09:20:30Z) - CQIL: Inference Latency Optimization with Concurrent Computation of Quasi-Independent Layers [21.91815582658188]
Large scale language models are delivering unprecedented performance on almost all natural language processing tasks.
The overwhelming complexity incurs a high inference latency that negatively affects user experience.
We propose to identify quasi-independent layers, which can be concurrently computed to significantly decrease inference latency.
arXiv Detail & Related papers (2024-04-10T03:30:01Z) - Towards More Accurate Diffusion Model Acceleration with A Timestep
Aligner [84.97253871387028]
A diffusion model, which is formulated to produce an image using thousands of denoising steps, usually suffers from a slow inference speed.
We propose a timestep aligner that helps find a more accurate integral direction for a particular interval at the minimum cost.
Experiments show that our plug-in design can be trained efficiently and boost the inference performance of various state-of-the-art acceleration methods.
arXiv Detail & Related papers (2023-10-14T02:19:07Z) - TrimTail: Low-Latency Streaming ASR with Simple but Effective
Spectrogram-Level Length Penalty [14.71509986713044]
We present TrimTail, a simple but effective emission regularization method to improve the latency of streaming ASR models.
We achieve 100 $sim$ 200ms latency reduction with equal or even better accuracy on both Aishell-1 and Librispeech.
arXiv Detail & Related papers (2022-11-01T15:12:34Z) - Delay-penalized transducer for low-latency streaming ASR [26.39851372961386]
We propose a simple way to penalize symbol delay in transducer model, so that we can balance the trade-off between symbol delay and accuracy for streaming models without external alignments.
Our method achieves similar delay-accuracy trade-off to the previously published FastEmit, but we believe our method is preferable because it has a better justification: it is equivalent to penalizing the average symbol delay.
arXiv Detail & Related papers (2022-10-31T07:03:50Z) - FastEmit: Low-latency Streaming ASR with Sequence-level Emission
Regularization [78.46088089185156]
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible.
Existing approaches penalize emission delay by manipulating per-token or per-frame probability prediction in sequence transducer models.
We propose a sequence-level emission regularization method, named FastEmit, that applies latency regularization directly on per-sequence probability in training transducer models.
arXiv Detail & Related papers (2020-10-21T17:05:01Z) - Predicting Training Time Without Training [120.92623395389255]
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function.
We leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model.
We are able to predict the time it takes to fine-tune a model to a given loss without having to perform any training.
arXiv Detail & Related papers (2020-08-28T04:29:54Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Minimum Latency Training Strategies for Streaming Sequence-to-Sequence
ASR [44.229256049718316]
Streaming attention-based sequence-to-sequence (S2S) models have been proposed to perform online speech recognition with linear-time decoding complexity.
In these models, the decisions to generate tokens are delayed compared to the actual acoustic boundaries since their unidirectional encoders lack future information.
We propose several strategies during training by leveraging external hard alignments extracted from the hybrid model.
Experiments on the Cortana voice search task demonstrate that our proposed methods can significantly reduce the latency, and even improve the recognition accuracy in certain cases on the decoder side.
arXiv Detail & Related papers (2020-04-10T12:24:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.