Semi-Autoregressive Streaming ASR With Label Context
- URL: http://arxiv.org/abs/2309.10926v2
- Date: Tue, 20 Feb 2024 13:06:16 GMT
- Title: Semi-Autoregressive Streaming ASR With Label Context
- Authors: Siddhant Arora, George Saon, Shinji Watanabe, Brian Kingsbury
- Abstract summary: We propose a streaming "semi-autoregressive" ASR model that incorporates the labels emitted in previous blocks as additional context.
Experiments show that our method outperforms the existing streaming NAR model by 19% relative on Tedlium2, 16%/8% on Librispeech-100 clean/other test sets, and 19%/8% on the Switchboard(SWB)/Callhome(CH) test sets.
- Score: 70.76222767090638
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Non-autoregressive (NAR) modeling has gained significant interest in speech
processing since these models achieve dramatically lower inference time than
autoregressive (AR) models while also achieving good transcription accuracy.
Since NAR automatic speech recognition (ASR) models must wait for the
completion of the entire utterance before processing, some works explore
streaming NAR models based on blockwise attention for low-latency applications.
However, streaming NAR models significantly lag in accuracy compared to
streaming AR and non-streaming NAR models. To address this, we propose a
streaming "semi-autoregressive" ASR model that incorporates the labels emitted
in previous blocks as additional context using a Language Model (LM)
subnetwork. We also introduce a novel greedy decoding algorithm that addresses
insertion and deletion errors near block boundaries while not significantly
increasing the inference time. Experiments show that our method outperforms the
existing streaming NAR model by 19% relative on Tedlium2, 16%/8% on
Librispeech-100 clean/other test sets, and 19%/8% on the
Switchboard(SWB)/Callhome(CH) test sets. It also reduced the accuracy gap with
streaming AR and non-streaming NAR models while achieving 2.5x lower latency.
We also demonstrate that our approach can effectively utilize external text
data to pre-train the LM subnetwork to further improve streaming ASR accuracy.
Related papers
- Non-Autoregressive Machine Translation: It's Not as Fast as it Seems [84.47091735503979]
We point out flaws in the evaluation methodology present in the literature on NAR models.
We compare NAR models with other widely used methods for improving efficiency.
We call for more realistic and extensive evaluation of NAR models in future work.
arXiv Detail & Related papers (2022-05-04T09:30:17Z) - A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text
Generation [59.64193903397301]
Non-autoregressive (NAR) models simultaneously generate multiple outputs in a sequence, which significantly reduces the inference speed at the cost of accuracy drop compared to autoregressive baselines.
We conduct a comparative study of various NAR modeling methods for end-to-end automatic speech recognition (ASR)
The results on various tasks provide interesting findings for developing an understanding of NAR ASR, such as the accuracy-speed trade-off and robustness against long-form utterances.
arXiv Detail & Related papers (2021-10-11T13:05:06Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - WNARS: WFST based Non-autoregressive Streaming End-to-End Speech
Recognition [59.975078145303605]
We propose a novel framework, namely WNARS, using hybrid CTC-attention AED models and weighted finite-state transducers.
On the AISHELL-1 task, our WNARS achieves a character error rate of 5.22% with 640ms latency, to the best of our knowledge, which is the state-of-the-art performance for online ASR.
arXiv Detail & Related papers (2021-04-08T07:56:03Z) - TSNAT: Two-Step Non-Autoregressvie Transformer Models for Speech
Recognition [69.68154370877615]
The non-autoregressive (NAR) models can get rid of the temporal dependency between the output tokens and predict the entire output tokens in at least one step.
To address these two problems, we propose a new model named the two-step non-autoregressive transformer(TSNAT)
The results show that the TSNAT can achieve a competitive performance with the AR model and outperform many complicated NAR models.
arXiv Detail & Related papers (2021-04-04T02:34:55Z) - Improving Streaming Automatic Speech Recognition With Non-Streaming
Model Distillation On Unsupervised Data [44.48235209327319]
Streaming end-to-end automatic speech recognition models are widely used on smart speakers and on-device applications.
We propose a novel and effective learning method by leveraging a non-streaming ASR model as a teacher.
We scale the training of streaming models to up to 3 million hours of YouTube audio.
arXiv Detail & Related papers (2020-10-22T22:41:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.