Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech
Recognition
- URL: http://arxiv.org/abs/2107.01269v1
- Date: Fri, 2 Jul 2021 20:56:13 GMT
- Title: Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech
Recognition
- Authors: Niko Moritz, Takaaki Hori, Jonathan Le Roux
- Abstract summary: Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks.
However, the application of self-attention and attention-based encoder-decoder models remains challenging for streaming ASR.
We present the dual causal/non-causal self-attention architecture, which in contrast to restricted self-attention prevents the overall context to grow beyond the look-ahead of a single layer.
- Score: 58.69803243323346
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention-based end-to-end automatic speech recognition (ASR) systems have
recently demonstrated state-of-the-art results for numerous tasks. However, the
application of self-attention and attention-based encoder-decoder models
remains challenging for streaming ASR, where each word must be recognized
shortly after it was spoken. In this work, we present the dual
causal/non-causal self-attention (DCN) architecture, which in contrast to
restricted self-attention prevents the overall context to grow beyond the
look-ahead of a single layer when used in a deep architecture. DCN is compared
to chunk-based and restricted self-attention using streaming transformer and
conformer architectures, showing improved ASR performance over restricted
self-attention and competitive ASR results compared to chunk-based
self-attention, while providing the advantage of frame-synchronous processing.
Combined with triggered attention, the proposed streaming end-to-end ASR
systems obtained state-of-the-art results on the LibriSpeech, HKUST, and
Switchboard ASR tasks.
Related papers
- SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition [49.42625022146008]
We present the advantages of applying SRU++ in ASR tasks by comparing with Conformer across multiple ASR benchmarks.
Specifically, SRU++ can surpass Conformer on long-form speech input with a large margin, based on our analysis.
arXiv Detail & Related papers (2021-10-11T19:23:50Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - WNARS: WFST based Non-autoregressive Streaming End-to-End Speech
Recognition [59.975078145303605]
We propose a novel framework, namely WNARS, using hybrid CTC-attention AED models and weighted finite-state transducers.
On the AISHELL-1 task, our WNARS achieves a character error rate of 5.22% with 640ms latency, to the best of our knowledge, which is the state-of-the-art performance for online ASR.
arXiv Detail & Related papers (2021-04-08T07:56:03Z) - Capturing Multi-Resolution Context by Dilated Self-Attention [58.69803243323346]
We propose a combination of restricted self-attention and a dilation mechanism, which we refer to as dilated self-attention.
The restricted self-attention allows attention to neighboring frames of the query at a high resolution, and the dilation mechanism summarizes distant information to allow attending to it with a lower resolution.
ASR results demonstrate substantial improvements compared to restricted self-attention alone, achieving similar results compared to full-sequence based self-attention with a fraction of the computational costs.
arXiv Detail & Related papers (2021-04-07T02:04:18Z) - Sequence-to-Sequence Learning via Attention Transfer for Incremental
Speech Recognition [25.93405777713522]
We investigate whether it is possible to employ the original architecture of attention-based ASR for ISR tasks.
We design an alternative student network that, instead of using a thinner or a shallower model, keeps the original architecture of the teacher model but with shorter sequences.
Our experiments show that by delaying the starting time of recognition process with about 1.7 sec, we can achieve comparable performance to one that needs to wait until the end.
arXiv Detail & Related papers (2020-11-04T05:06:01Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.