WNARS: WFST based Non-autoregressive Streaming End-to-End Speech
Recognition
- URL: http://arxiv.org/abs/2104.03587v1
- Date: Thu, 8 Apr 2021 07:56:03 GMT
- Title: WNARS: WFST based Non-autoregressive Streaming End-to-End Speech
Recognition
- Authors: Zhichao Wang, Wenwen Yang, Pan Zhou, Wei Chen
- Abstract summary: We propose a novel framework, namely WNARS, using hybrid CTC-attention AED models and weighted finite-state transducers.
On the AISHELL-1 task, our WNARS achieves a character error rate of 5.22% with 640ms latency, to the best of our knowledge, which is the state-of-the-art performance for online ASR.
- Score: 59.975078145303605
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, attention-based encoder-decoder (AED) end-to-end (E2E) models have
drawn more and more attention in the field of automatic speech recognition
(ASR). AED models, however, still have drawbacks when deploying in commercial
applications. Autoregressive beam search decoding makes it inefficient for
high-concurrency applications. It is also inconvenient to integrate external
word-level language models. The most important thing is that AED models are
difficult for streaming recognition due to global attention mechanism. In this
paper, we propose a novel framework, namely WNARS, using hybrid CTC-attention
AED models and weighted finite-state transducers (WFST) to solve these problems
together. We switch from autoregressive beam search to CTC branch decoding,
which performs first-pass decoding with WFST in chunk-wise streaming way. The
decoder branch then performs second-pass rescoring on the generated hypotheses
non-autoregressively. On the AISHELL-1 task, our WNARS achieves a character
error rate of 5.22% with 640ms latency, to the best of our knowledge, which is
the state-of-the-art performance for online ASR. Further experiments on our
10,000-hour Mandarin task show the proposed method achieves more than 20%
improvements with 50% latency compared to a strong TDNN-BLSTM lattice-free MMI
baseline.
Related papers
- Semi-Autoregressive Streaming ASR With Label Context [70.76222767090638]
We propose a streaming "semi-autoregressive" ASR model that incorporates the labels emitted in previous blocks as additional context.
Experiments show that our method outperforms the existing streaming NAR model by 19% relative on Tedlium2, 16%/8% on Librispeech-100 clean/other test sets, and 19%/8% on the Switchboard(SWB)/Callhome(CH) test sets.
arXiv Detail & Related papers (2023-09-19T20:55:58Z) - Streaming Speech-to-Confusion Network Speech Recognition [19.720334657478475]
We present a novel streaming ASR architecture that outputs a confusion network while maintaining limited latency.
We show that 1-best results of our model are on par with a comparable RNN-T system.
We also show that our model outperforms a strong RNN-T baseline on a far-field voice assistant task.
arXiv Detail & Related papers (2023-06-02T20:28:14Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - Relaxed Attention: A Simple Method to Boost Performance of End-to-End
Automatic Speech Recognition [27.530537066239116]
We introduce the concept of relaxed attention, which is a gradual injection of a uniform distribution to the encoder-decoder attention weights during training.
We find that transformers trained with relaxed attention outperform the standard baseline models consistently during decoding with external language models.
On WSJ, we set a new benchmark for transformer-based end-to-end speech recognition with a word error rate of 3.65%, outperforming state of the art (4.20%) by 13.1% relative.
arXiv Detail & Related papers (2021-07-02T21:01:17Z) - Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech
Recognition [58.69803243323346]
Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks.
However, the application of self-attention and attention-based encoder-decoder models remains challenging for streaming ASR.
We present the dual causal/non-causal self-attention architecture, which in contrast to restricted self-attention prevents the overall context to grow beyond the look-ahead of a single layer.
arXiv Detail & Related papers (2021-07-02T20:56:13Z) - Raw Waveform Encoder with Multi-Scale Globally Attentive Locally
Recurrent Networks for End-to-End Speech Recognition [45.858039215825656]
We propose a new encoder that adopts globally attentive locally recurrent (GALR) networks and directly takes raw waveform as input.
Experiments are conducted on a benchmark dataset AISHELL-2 and two large-scale Mandarin speech corpus of 5,000 hours and 21,000 hours.
arXiv Detail & Related papers (2021-06-08T12:12:33Z) - Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech
Recognition [19.971343876930767]
We present a novel two-pass approach to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model.
Our model adopts the hybrid CTC/attention architecture, in which the conformer layers in the encoder are modified.
Experiments on the open 170-hour AISHELL-1 dataset show that, the proposed method can unify the streaming and non-streaming model simply and efficiently.
arXiv Detail & Related papers (2020-12-10T06:54:54Z) - Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model.
The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses.
A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.