Streaming Align-Refine for Non-autoregressive Deliberation
- URL: http://arxiv.org/abs/2204.07556v1
- Date: Fri, 15 Apr 2022 17:24:39 GMT
- Title: Streaming Align-Refine for Non-autoregressive Deliberation
- Authors: Weiran Wang, Ke Hu, Tara N. Sainath
- Abstract summary: We propose a streaming non-autoregressive (non-AR) decoding algorithm to deliberate the hypothesis alignment of a streaming RNN-T model.
Our algorithm facilitates a simple greedy decoding procedure, and at the same time is capable of producing the decoding result at each frame with limited right context.
Experiments on voice search datasets and Librispeech show that with reasonable right context, our streaming model performs as well as the offline counterpart.
- Score: 42.748839817396046
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a streaming non-autoregressive (non-AR) decoding algorithm to
deliberate the hypothesis alignment of a streaming RNN-T model. Our algorithm
facilitates a simple greedy decoding procedure, and at the same time is capable
of producing the decoding result at each frame with limited right context, thus
enjoying both high efficiency and low latency. These advantages are achieved by
converting the offline Align-Refine algorithm to be streaming-compatible, with
a novel transformer decoder architecture that performs local self-attentions
for both text and audio, and a time-aligned cross-attention at each layer.
Furthermore, we perform discriminative training of our model with the minimum
word error rate (MWER) criterion, which has not been done in the non-AR
decoding literature. Experiments on voice search datasets and Librispeech show
that with reasonable right context, our streaming model performs as well as the
offline counterpart, and discriminative training leads to further WER gain when
the first-pass model has small capacity.
Related papers
- Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition [20.052245837954175]
We propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture.
We introduce an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively during inference.
A hybrid CTC/RNNT architecture which utilizes a shared encoder with both a CTC and RNNT decoder to boost the accuracy and save computation.
arXiv Detail & Related papers (2023-12-27T21:04:26Z) - Semi-Autoregressive Streaming ASR With Label Context [70.76222767090638]
We propose a streaming "semi-autoregressive" ASR model that incorporates the labels emitted in previous blocks as additional context.
Experiments show that our method outperforms the existing streaming NAR model by 19% relative on Tedlium2, 16%/8% on Librispeech-100 clean/other test sets, and 19%/8% on the Switchboard(SWB)/Callhome(CH) test sets.
arXiv Detail & Related papers (2023-09-19T20:55:58Z) - Streaming Audio-Visual Speech Recognition with Alignment Regularization [69.30185151873707]
We propose a streaming AV-ASR system based on a hybrid connectionist temporal classification ( CTC)/attention neural network architecture.
The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 dataset in an offline and online setup.
arXiv Detail & Related papers (2022-11-03T20:20:47Z) - Deliberation of Streaming RNN-Transducer by Non-autoregressive Decoding [21.978994865937786]
The method performs a few refinement steps, where each step shares a transformer decoder that attends to both text features and audio features.
We show that, conditioned on hypothesis alignments of a streaming RNN-T model, our method obtains significantly more accurate recognition results than the first-pass RNN-T.
arXiv Detail & Related papers (2021-12-01T01:34:28Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - WNARS: WFST based Non-autoregressive Streaming End-to-End Speech
Recognition [59.975078145303605]
We propose a novel framework, namely WNARS, using hybrid CTC-attention AED models and weighted finite-state transducers.
On the AISHELL-1 task, our WNARS achieves a character error rate of 5.22% with 640ms latency, to the best of our knowledge, which is the state-of-the-art performance for online ASR.
arXiv Detail & Related papers (2021-04-08T07:56:03Z) - Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech
Recognition [19.971343876930767]
We present a novel two-pass approach to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model.
Our model adopts the hybrid CTC/attention architecture, in which the conformer layers in the encoder are modified.
Experiments on the open 170-hour AISHELL-1 dataset show that, the proposed method can unify the streaming and non-streaming model simply and efficiently.
arXiv Detail & Related papers (2020-12-10T06:54:54Z) - Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable
End-to-End Speech Recognition [8.046120977786702]
Transformer has achieved competitive performance against state-of-the-art end-to-end models in automatic speech recognition (ASR)
The original Transformer, with encoder-decoder architecture, is only suitable for offline ASR.
We show that this architecture, named Conv-Transformer Transducer, achieves competitive performance on LibriSpeech dataset (3.6% WER on test-clean) without external language models.
arXiv Detail & Related papers (2020-08-13T08:20:02Z) - Transformer Transducer: A Streamable Speech Recognition Model with
Transformer Encoders and RNN-T Loss [14.755108017449295]
We present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system.
Transformer computation blocks based on self-attention are used to encode both audio and label sequences independently.
We present results on the LibriSpeech dataset showing that limiting the left context for self-attention makes decoding computationally tractable for streaming.
arXiv Detail & Related papers (2020-02-07T00:04:04Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.