Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech
Recognition
- URL: http://arxiv.org/abs/2012.05481v1
- Date: Thu, 10 Dec 2020 06:54:54 GMT
- Title: Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech
Recognition
- Authors: Binbin Zhang, Di Wu, Zhuoyuan Yao, Xiong Wang, Fan Yu, Chao Yang,
Liyong Guo, Yaguang Hu, Lei Xie, Xin Lei
- Abstract summary: We present a novel two-pass approach to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model.
Our model adopts the hybrid CTC/attention architecture, in which the conformer layers in the encoder are modified.
Experiments on the open 170-hour AISHELL-1 dataset show that, the proposed method can unify the streaming and non-streaming model simply and efficiently.
- Score: 19.971343876930767
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present a novel two-pass approach to unify streaming and
non-streaming end-to-end (E2E) speech recognition in a single model. Our model
adopts the hybrid CTC/attention architecture, in which the conformer layers in
the encoder are modified. We propose a dynamic chunk-based attention strategy
to allow arbitrary right context length. At inference time, the CTC decoder
generates n-best hypotheses in a streaming way. The inference latency could be
easily controlled by only changing the chunk size. The CTC hypotheses are then
rescored by the attention decoder to get the final result. This efficient
rescoring process causes very little sentence-level latency. Our experiments on
the open 170-hour AISHELL-1 dataset show that, the proposed method can unify
the streaming and non-streaming model simply and efficiently. On the AISHELL-1
test set, our unified model achieves 5.60% relative character error rate (CER)
reduction in non-streaming ASR compared to a standard non-streaming
transformer. The same model achieves 5.42% CER with 640ms latency in a
streaming ASR system.
Related papers
- Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition [20.052245837954175]
We propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture.
We introduce an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively during inference.
A hybrid CTC/RNNT architecture which utilizes a shared encoder with both a CTC and RNNT decoder to boost the accuracy and save computation.
arXiv Detail & Related papers (2023-12-27T21:04:26Z) - On Comparison of Encoders for Attention based End to End Speech
Recognition in Standalone and Rescoring Mode [1.7704011486040847]
Non-streaming models provide better performance as they look at the entire audio context.
We show that the Transformer model offers acceptable WER with the lowest latency requirements.
We highlight the importance of CNN front-end with Transformer architecture to achieve comparable word error rates (WER)
arXiv Detail & Related papers (2022-06-26T09:12:27Z) - Streaming Align-Refine for Non-autoregressive Deliberation [42.748839817396046]
We propose a streaming non-autoregressive (non-AR) decoding algorithm to deliberate the hypothesis alignment of a streaming RNN-T model.
Our algorithm facilitates a simple greedy decoding procedure, and at the same time is capable of producing the decoding result at each frame with limited right context.
Experiments on voice search datasets and Librispeech show that with reasonable right context, our streaming model performs as well as the offline counterpart.
arXiv Detail & Related papers (2022-04-15T17:24:39Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - WNARS: WFST based Non-autoregressive Streaming End-to-End Speech
Recognition [59.975078145303605]
We propose a novel framework, namely WNARS, using hybrid CTC-attention AED models and weighted finite-state transducers.
On the AISHELL-1 task, our WNARS achieves a character error rate of 5.22% with 640ms latency, to the best of our knowledge, which is the state-of-the-art performance for online ASR.
arXiv Detail & Related papers (2021-04-08T07:56:03Z) - Cascaded encoders for unifying streaming and non-streaming ASR [68.62941009369125]
This work presents cascaded encoders for building a single E2E ASR model that can operate in both these modes simultaneously.
A single decoder then learns to decode either using the output of the streaming or the non-streaming encoder.
Results show that this model achieves similar word error rates (WER) as a standalone streaming model when operating in streaming mode, and obtains 10% -- 27% relative improvement when operating in non-streaming mode.
arXiv Detail & Related papers (2020-10-27T20:59:50Z) - FastEmit: Low-latency Streaming ASR with Sequence-level Emission
Regularization [78.46088089185156]
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible.
Existing approaches penalize emission delay by manipulating per-token or per-frame probability prediction in sequence transducer models.
We propose a sequence-level emission regularization method, named FastEmit, that applies latency regularization directly on per-sequence probability in training transducer models.
arXiv Detail & Related papers (2020-10-21T17:05:01Z) - Transformer Transducer: One Model Unifying Streaming and Non-streaming
Speech Recognition [16.082949461807335]
We present a Transformer-Transducer model architecture and a training technique to unify streaming and non-streaming speech recognition models into one model.
We show that we can run this model in a Y-model architecture with the top layers running in parallel in low latency and high latency modes.
This allows us to have streaming speech recognition results with limited latency and delayed speech recognition results with large improvements in accuracy.
arXiv Detail & Related papers (2020-10-07T05:58:28Z) - Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model.
The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses.
A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.