FastEmit: Low-latency Streaming ASR with Sequence-level Emission
Regularization
- URL: http://arxiv.org/abs/2010.11148v2
- Date: Wed, 3 Feb 2021 20:59:05 GMT
- Title: FastEmit: Low-latency Streaming ASR with Sequence-level Emission
Regularization
- Authors: Jiahui Yu, Chung-Cheng Chiu, Bo Li, Shuo-yiin Chang, Tara N. Sainath,
Yanzhang He, Arun Narayanan, Wei Han, Anmol Gulati, Yonghui Wu, Ruoming Pang
- Abstract summary: Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible.
Existing approaches penalize emission delay by manipulating per-token or per-frame probability prediction in sequence transducer models.
We propose a sequence-level emission regularization method, named FastEmit, that applies latency regularization directly on per-sequence probability in training transducer models.
- Score: 78.46088089185156
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Streaming automatic speech recognition (ASR) aims to emit each hypothesized
word as quickly and accurately as possible. However, emitting fast without
degrading quality, as measured by word error rate (WER), is highly challenging.
Existing approaches including Early and Late Penalties and Constrained
Alignments penalize emission delay by manipulating per-token or per-frame
probability prediction in sequence transducer models. While being successful in
reducing delay, these approaches suffer from significant accuracy regression
and also require additional word alignment information from an existing model.
In this work, we propose a sequence-level emission regularization method, named
FastEmit, that applies latency regularization directly on per-sequence
probability in training transducer models, and does not require any alignment.
We demonstrate that FastEmit is more suitable to the sequence-level
optimization of transducer models for streaming ASR by applying it on various
end-to-end streaming ASR networks including RNN-Transducer,
Transformer-Transducer, ConvNet-Transducer and Conformer-Transducer. We achieve
150-300 ms latency reduction with significantly better accuracy over previous
techniques on a Voice Search test set. FastEmit also improves streaming ASR
accuracy from 4.4%/8.9% to 3.1%/7.5% WER, meanwhile reduces 90th percentile
latency from 210 ms to only 30 ms on LibriSpeech.
Related papers
- Towards More Accurate Diffusion Model Acceleration with A Timestep
Aligner [84.97253871387028]
A diffusion model, which is formulated to produce an image using thousands of denoising steps, usually suffers from a slow inference speed.
We propose a timestep aligner that helps find a more accurate integral direction for a particular interval at the minimum cost.
Experiments show that our plug-in design can be trained efficiently and boost the inference performance of various state-of-the-art acceleration methods.
arXiv Detail & Related papers (2023-10-14T02:19:07Z) - Semi-Autoregressive Streaming ASR With Label Context [70.76222767090638]
We propose a streaming "semi-autoregressive" ASR model that incorporates the labels emitted in previous blocks as additional context.
Experiments show that our method outperforms the existing streaming NAR model by 19% relative on Tedlium2, 16%/8% on Librispeech-100 clean/other test sets, and 19%/8% on the Switchboard(SWB)/Callhome(CH) test sets.
arXiv Detail & Related papers (2023-09-19T20:55:58Z) - Minimum Latency Training of Sequence Transducers for Streaming
End-to-End Speech Recognition [38.28868751443619]
We propose a new training method to explicitly model and reduce the latency of sequence transducer models.
Experimental results show that the proposed minimum latency training reduces the latency of causal Conformer-T from 220 ms to 27 ms within a WER degradation of 0.7%.
arXiv Detail & Related papers (2022-11-04T09:19:59Z) - TrimTail: Low-Latency Streaming ASR with Simple but Effective
Spectrogram-Level Length Penalty [14.71509986713044]
We present TrimTail, a simple but effective emission regularization method to improve the latency of streaming ASR models.
We achieve 100 $sim$ 200ms latency reduction with equal or even better accuracy on both Aishell-1 and Librispeech.
arXiv Detail & Related papers (2022-11-01T15:12:34Z) - Delay-penalized transducer for low-latency streaming ASR [26.39851372961386]
We propose a simple way to penalize symbol delay in transducer model, so that we can balance the trade-off between symbol delay and accuracy for streaming models without external alignments.
Our method achieves similar delay-accuracy trade-off to the previously published FastEmit, but we believe our method is preferable because it has a better justification: it is equivalent to penalizing the average symbol delay.
arXiv Detail & Related papers (2022-10-31T07:03:50Z) - Reducing Streaming ASR Model Delay with Self Alignment [20.61461084287351]
Constrained alignment is a well-known existing approach that penalizes predicted word boundaries using external low-latency acoustic models.
FastEmit is a sequence-level delay regularization scheme encouraging vocabulary tokens over blanks without any reference alignments.
In this paper, we propose a novel delay constraining method, named self alignment.
arXiv Detail & Related papers (2021-05-06T18:00:11Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input [54.82369261350497]
We propose a CTC-enhanced NAR transformer, which generates target sequence by refining predictions of the CTC module.
Experimental results show that our method outperforms all previous NAR counterparts and achieves 50x faster decoding speed than a strong AR baseline with only 0.0 0.3 absolute CER degradation on Aishell-1 and Aishell-2 datasets.
arXiv Detail & Related papers (2020-10-28T15:00:09Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.