Reducing Streaming ASR Model Delay with Self Alignment
- URL: http://arxiv.org/abs/2105.05005v1
- Date: Thu, 6 May 2021 18:00:11 GMT
- Title: Reducing Streaming ASR Model Delay with Self Alignment
- Authors: Jaeyoung Kim, Han Lu, Anshuman Tripathi, Qian Zhang and Hasim Sak
- Abstract summary: Constrained alignment is a well-known existing approach that penalizes predicted word boundaries using external low-latency acoustic models.
FastEmit is a sequence-level delay regularization scheme encouraging vocabulary tokens over blanks without any reference alignments.
In this paper, we propose a novel delay constraining method, named self alignment.
- Score: 20.61461084287351
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Reducing prediction delay for streaming end-to-end ASR models with minimal
performance regression is a challenging problem. Constrained alignment is a
well-known existing approach that penalizes predicted word boundaries using
external low-latency acoustic models. On the contrary, recently proposed
FastEmit is a sequence-level delay regularization scheme encouraging vocabulary
tokens over blanks without any reference alignments. Although all these schemes
are successful in reducing delay, ASR word error rate (WER) often severely
degrades after applying these delay constraining schemes. In this paper, we
propose a novel delay constraining method, named self alignment. Self alignment
does not require external alignment models. Instead, it utilizes Viterbi
forced-alignments from the trained model to find the lower latency alignment
direction. From LibriSpeech evaluation, self alignment outperformed existing
schemes: 25% and 56% less delay compared to FastEmit and constrained alignment
at the similar word error rate. For Voice Search evaluation,12% and 25% delay
reductions were achieved compared to FastEmit and constrained alignment with
more than 2% WER improvements.
Related papers
- Evaluation of real-time transcriptions using end-to-end ASR models [41.94295877935867]
In real-time scenarios, the audio is not pre-recorded, and the input audio must be fragmented to be processed by the ASR systems.
In this paper, three audio splitting algorithms are evaluated with different ASR models to determine their impact on both the quality of the transcription and the end-to-end delay.
arXiv Detail & Related papers (2024-09-09T14:41:57Z) - Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling [73.5602474095954]
We study the non-asymptotic performance of approximation schemes with delayed updates under Markovian sampling.
Our theoretical findings shed light on the finite-time effects of delays for a broad class of algorithms.
arXiv Detail & Related papers (2024-02-19T03:08:02Z) - Semi-Autoregressive Streaming ASR With Label Context [70.76222767090638]
We propose a streaming "semi-autoregressive" ASR model that incorporates the labels emitted in previous blocks as additional context.
Experiments show that our method outperforms the existing streaming NAR model by 19% relative on Tedlium2, 16%/8% on Librispeech-100 clean/other test sets, and 19%/8% on the Switchboard(SWB)/Callhome(CH) test sets.
arXiv Detail & Related papers (2023-09-19T20:55:58Z) - Minimum Latency Training of Sequence Transducers for Streaming
End-to-End Speech Recognition [38.28868751443619]
We propose a new training method to explicitly model and reduce the latency of sequence transducer models.
Experimental results show that the proposed minimum latency training reduces the latency of causal Conformer-T from 220 ms to 27 ms within a WER degradation of 0.7%.
arXiv Detail & Related papers (2022-11-04T09:19:59Z) - Delay-penalized transducer for low-latency streaming ASR [26.39851372961386]
We propose a simple way to penalize symbol delay in transducer model, so that we can balance the trade-off between symbol delay and accuracy for streaming models without external alignments.
Our method achieves similar delay-accuracy trade-off to the previously published FastEmit, but we believe our method is preferable because it has a better justification: it is equivalent to penalizing the average symbol delay.
arXiv Detail & Related papers (2022-10-31T07:03:50Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - FastCorrect: Fast Error Correction with Edit Alignment for Automatic
Speech Recognition [90.34177266618143]
We propose FastCorrect, a novel NAR error correction model based on edit alignment.
FastCorrect speeds up the inference by 6-9 times and maintains the accuracy (8-14% WER reduction) compared with the autoregressive correction model.
It outperforms the accuracy of popular NAR models adopted in neural machine translation by a large margin.
arXiv Detail & Related papers (2021-05-09T05:35:36Z) - FastEmit: Low-latency Streaming ASR with Sequence-level Emission
Regularization [78.46088089185156]
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible.
Existing approaches penalize emission delay by manipulating per-token or per-frame probability prediction in sequence transducer models.
We propose a sequence-level emission regularization method, named FastEmit, that applies latency regularization directly on per-sequence probability in training transducer models.
arXiv Detail & Related papers (2020-10-21T17:05:01Z) - Adaptive Braking for Mitigating Gradient Delay [0.8602553195689513]
We introduce Adaptive Braking, a modification for momentum-based gradients that mitigates the effects of gradient delay.
We show that applying AB on top of SGD with momentum enables training ResNets on CIFAR-10 and ImageNet-1k with delays with minimal drop in final test accuracy.
arXiv Detail & Related papers (2020-07-02T21:26:27Z) - Listen Attentively, and Spell Once: Whole Sentence Generation via a
Non-Autoregressive Architecture for Low-Latency Speech Recognition [66.47000813920619]
We propose a non-autoregressive end-to-end speech recognition system called LASO.
Because of the non-autoregressive property, LASO predicts a textual token in the sequence without the dependence on other tokens.
We conduct experiments on publicly available Chinese dataset AISHELL-1.
arXiv Detail & Related papers (2020-05-11T04:45:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.