Non-autoregressive Error Correction for CTC-based ASR with
Phone-conditioned Masked LM
- URL: http://arxiv.org/abs/2209.04062v1
- Date: Thu, 8 Sep 2022 23:42:37 GMT
- Title: Non-autoregressive Error Correction for CTC-based ASR with
Phone-conditioned Masked LM
- Authors: Hayato Futami, Hirofumi Inaguma, Sei Ueno, Masato Mimura, Shinsuke
Sakai, Tatsuya Kawahara
- Abstract summary: We propose an error correction method with phone-conditioned masked LM (PC-MLM)
Since both CTC and PC-MLM are non-autoregressive models, the method enables fast LM integration.
- Score: 39.03817586745041
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Connectionist temporal classification (CTC) -based models are attractive in
automatic speech recognition (ASR) because of their non-autoregressive nature.
To take advantage of text-only data, language model (LM) integration approaches
such as rescoring and shallow fusion have been widely used for CTC. However,
they lose CTC's non-autoregressive nature because of the need for beam search,
which slows down the inference speed. In this study, we propose an error
correction method with phone-conditioned masked LM (PC-MLM). In the proposed
method, less confident word tokens in a greedy decoded output from CTC are
masked. PC-MLM then predicts these masked word tokens given unmasked words and
phones supplementally predicted from CTC. We further extend it to Deletable
PC-MLM in order to address insertion errors. Since both CTC and PC-MLM are
non-autoregressive models, the method enables fast LM integration. Experimental
evaluations on the Corpus of Spontaneous Japanese (CSJ) and TED-LIUM2 in domain
adaptation setting shows that our proposed method outperformed rescoring and
shallow fusion in terms of inference speed, and also in terms of recognition
accuracy on CSJ.
Related papers
- Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter [57.64003871384959]
This work presents a new approach to fast context-biasing with CTC-based Word Spotter.
The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates.
The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER.
arXiv Detail & Related papers (2024-06-11T09:37:52Z) - Self-distillation Regularized Connectionist Temporal Classification Loss
for Text Recognition: A Simple Yet Effective Approach [14.69981874614434]
We show how to better optimize a text recognition model from the perspective of loss functions.
CTC-based methods, widely used in practice due to their good balance between performance and inference speed, still grapple with degradation accuracy.
We propose a self-distillation scheme for CTC-based model to address this issue.
arXiv Detail & Related papers (2023-08-17T06:32:57Z) - CTC-based Non-autoregressive Speech Translation [51.37920141751813]
We investigate the potential of connectionist temporal classification for non-autoregressive speech translation.
We develop a model consisting of two encoders that are guided by CTC to predict the source and target texts.
Experiments on the MuST-C benchmarks show that our NAST model achieves an average BLEU score of 29.5 with a speed-up of 5.67$times$.
arXiv Detail & Related papers (2023-05-27T03:54:09Z) - CTC Alignments Improve Autoregressive Translation [145.90587287444976]
We argue that CTC does in fact make sense for translation if applied in a joint CTC/attention framework.
Our proposed joint CTC/attention models outperform pure-attention baselines across six benchmark translation tasks.
arXiv Detail & Related papers (2022-10-11T07:13:50Z) - Intermediate Loss Regularization for CTC-based Speech Recognition [58.33721897180646]
We present a simple and efficient auxiliary loss function for automatic speech recognition (ASR) based on the connectionist temporal classification ( CTC) objective.
We evaluate the proposed method on various corpora, reaching word error rate (WER) 9.9% on the WSJ corpus and character error rate (CER) 5.2% on the AISHELL-1 corpus respectively.
arXiv Detail & Related papers (2021-02-05T15:01:03Z) - Improved Mask-CTC for Non-Autoregressive End-to-End ASR [49.192579824582694]
Recently proposed end-to-end ASR system based on mask-predict with connectionist temporal classification (CTC)
We propose to enhance the network architecture by employing a recently proposed architecture called Conformer.
Next, we propose new training and decoding methods by introducing auxiliary objective to predict the length of a partial target sequence.
arXiv Detail & Related papers (2020-10-26T01:22:35Z) - Reducing Spelling Inconsistencies in Code-Switching ASR using
Contextualized CTC Loss [5.707652271634435]
We propose Contextualized Connectionist Temporal Classification (CCTC) loss to encourage spelling consistencies.
CCTC loss does not require frame-level alignments, since the context ground truth is obtained from the model's estimated path.
Compared to the same model trained with regular CTC loss, our method consistently improved the ASR performance on both CS and monolingual corpora.
arXiv Detail & Related papers (2020-05-16T09:36:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.