An Investigation of Enhancing CTC Model for Triggered Attention-based
Streaming ASR
- URL: http://arxiv.org/abs/2110.10402v1
- Date: Wed, 20 Oct 2021 06:44:58 GMT
- Title: An Investigation of Enhancing CTC Model for Triggered Attention-based
Streaming ASR
- Authors: Huaibo Zhao, Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi
- Abstract summary: An attempt is made to combine Mask-CTC and the triggered attention mechanism to construct a streaming end-to-end automatic speech recognition (ASR) system.
The proposed method achieves higher accuracy with lower latency than the conventional triggered attention-based streaming ASR system.
- Score: 19.668440671541546
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the present paper, an attempt is made to combine Mask-CTC and the
triggered attention mechanism to construct a streaming end-to-end automatic
speech recognition (ASR) system that provides high performance with low
latency. The triggered attention mechanism, which performs autoregressive
decoding triggered by the CTC spike, has shown to be effective in streaming
ASR. However, in order to maintain high accuracy of alignment estimation based
on CTC outputs, which is the key to its performance, it is inevitable that
decoding should be performed with some future information input (i.e., with
higher latency). It should be noted that in streaming ASR, it is desirable to
be able to achieve high recognition accuracy while keeping the latency low.
Therefore, the present study aims to achieve highly accurate streaming ASR with
low latency by introducing Mask-CTC, which is capable of learning feature
representations that anticipate future information (i.e., that can consider
long-term contexts), to the encoder pre-training. Experimental comparisons
conducted using WSJ data demonstrate that the proposed method achieves higher
accuracy with lower latency than the conventional triggered attention-based
streaming ASR system.
Related papers
- Mamba for Streaming ASR Combined with Unimodal Aggregation [7.6112706449833505]
Mamba, a recently proposed state space model, has demonstrated the ability to match or surpass Transformers in various tasks.
We propose an associated lookahead mechanism for leveraging controllable future information.
Experiments conducted on two Mandarin Chinese datasets demonstrate that the proposed model achieves competitive ASR performance.
arXiv Detail & Related papers (2024-09-30T12:11:49Z) - Model-based Deep Learning Receiver Design for Rate-Splitting Multiple
Access [65.21117658030235]
This work proposes a novel design for a practical RSMA receiver based on model-based deep learning (MBDL) methods.
The MBDL receiver is evaluated in terms of uncoded Symbol Error Rate (SER), throughput performance through Link-Level Simulations (LLS) and average training overhead.
Results reveal that the MBDL outperforms by a significant margin the SIC receiver with imperfect CSIR.
arXiv Detail & Related papers (2022-05-02T12:23:55Z) - Streaming parallel transducer beam search with fast-slow cascaded
encoders [23.416682253435837]
Streaming and non-streaming ASR for RNN Transducers can be unified by cascading causal and non-causal encoders.
We propose a novel parallel time-synchronous beam search algorithm for transducers that decodes from fast-slow encoders.
arXiv Detail & Related papers (2022-03-29T17:29:39Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording [46.69852287267763]
We propose a block-synchronous beam search decoding to take advantage of efficient batched output-synchronous and low-latency input-synchronous searches.
We also propose a VAD-free inference algorithm that leverages probabilities to determine a suitable timing to reset the model states.
Experimental evaluations demonstrate that the block-synchronous decoding achieves comparable accuracy to the label-synchronous one.
arXiv Detail & Related papers (2021-07-15T17:59:10Z) - Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech
Recognition [58.69803243323346]
Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks.
However, the application of self-attention and attention-based encoder-decoder models remains challenging for streaming ASR.
We present the dual causal/non-causal self-attention architecture, which in contrast to restricted self-attention prevents the overall context to grow beyond the look-ahead of a single layer.
arXiv Detail & Related papers (2021-07-02T20:56:13Z) - Alignment Knowledge Distillation for Online Streaming Attention-based
Speech Recognition [46.69852287267763]
This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems.
The proposed method significantly reduces recognition errors and emission latency simultaneously.
The best MoChA system shows performance comparable to that of RNN-transducer (RNN-T)
arXiv Detail & Related papers (2021-02-28T08:17:38Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.