Improved Mask-CTC for Non-Autoregressive End-to-End ASR
- URL: http://arxiv.org/abs/2010.13270v2
- Date: Tue, 16 Feb 2021 05:46:18 GMT
- Title: Improved Mask-CTC for Non-Autoregressive End-to-End ASR
- Authors: Yosuke Higuchi, Hirofumi Inaguma, Shinji Watanabe, Tetsuji Ogawa,
Tetsunori Kobayashi
- Abstract summary: Recently proposed end-to-end ASR system based on mask-predict with connectionist temporal classification (CTC)
We propose to enhance the network architecture by employing a recently proposed architecture called Conformer.
Next, we propose new training and decoding methods by introducing auxiliary objective to predict the length of a partial target sequence.
- Score: 49.192579824582694
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For real-world deployment of automatic speech recognition (ASR), the system
is desired to be capable of fast inference while relieving the requirement of
computational resources. The recently proposed end-to-end ASR system based on
mask-predict with connectionist temporal classification (CTC), Mask-CTC,
fulfills this demand by generating tokens in a non-autoregressive fashion.
While Mask-CTC achieves remarkably fast inference speed, its recognition
performance falls behind that of conventional autoregressive (AR) systems. To
boost the performance of Mask-CTC, we first propose to enhance the encoder
network architecture by employing a recently proposed architecture called
Conformer. Next, we propose new training and decoding methods by introducing
auxiliary objective to predict the length of a partial target sequence, which
allows the model to delete or insert tokens during inference. Experimental
results on different ASR tasks show that the proposed approaches improve
Mask-CTC significantly, outperforming a standard CTC model (15.5% $\rightarrow$
9.1% WER on WSJ). Moreover, Mask-CTC now achieves competitive results to AR
models with no degradation of inference speed ($<$ 0.1 RTF using CPU). We also
show a potential application of Mask-CTC to end-to-end speech translation.
Related papers
- Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter [57.64003871384959]
This work presents a new approach to fast context-biasing with CTC-based Word Spotter.
The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates.
The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER.
arXiv Detail & Related papers (2024-06-11T09:37:52Z) - Unimodal Aggregation for CTC-based Speech Recognition [7.6112706449833505]
A unimodal aggregation (UMA) is proposed to segment and integrate the feature frames that belong to the same text token.
UMA learns better feature representations and shortens the sequence length, resulting in lower recognition error and computational complexity.
arXiv Detail & Related papers (2023-09-15T04:34:40Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - Layer Pruning on Demand with Intermediate CTC [50.509073206630994]
We present a training and pruning method for ASR based on the connectionist temporal classification (CTC)
We show that a Transformer-CTC model can be pruned in various depth on demand, improving real-time factor from 0.005 to 0.002 on GPU.
arXiv Detail & Related papers (2021-06-17T02:40:18Z) - Relaxing the Conditional Independence Assumption of CTC-based ASR by
Conditioning on Intermediate Predictions [14.376418789524783]
We train a CTC-based ASR model with auxiliary CTC losses in intermediate layers in addition to the original CTC loss in the last layer.
Our method is easy to implement and retains the merits of CTC-based ASR: a simple model architecture and fast decoding speed.
arXiv Detail & Related papers (2021-04-06T18:00:03Z) - Intermediate Loss Regularization for CTC-based Speech Recognition [58.33721897180646]
We present a simple and efficient auxiliary loss function for automatic speech recognition (ASR) based on the connectionist temporal classification ( CTC) objective.
We evaluate the proposed method on various corpora, reaching word error rate (WER) 9.9% on the WSJ corpus and character error rate (CER) 5.2% on the AISHELL-1 corpus respectively.
arXiv Detail & Related papers (2021-02-05T15:01:03Z) - Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input [54.82369261350497]
We propose a CTC-enhanced NAR transformer, which generates target sequence by refining predictions of the CTC module.
Experimental results show that our method outperforms all previous NAR counterparts and achieves 50x faster decoding speed than a strong AR baseline with only 0.0 0.3 absolute CER degradation on Aishell-1 and Aishell-2 datasets.
arXiv Detail & Related papers (2020-10-28T15:00:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.