Relaxing the Conditional Independence Assumption of CTC-based ASR by
Conditioning on Intermediate Predictions
- URL: http://arxiv.org/abs/2104.02724v1
- Date: Tue, 6 Apr 2021 18:00:03 GMT
- Title: Relaxing the Conditional Independence Assumption of CTC-based ASR by
Conditioning on Intermediate Predictions
- Authors: Jumon Nozaki, Tatsuya Komatsu
- Abstract summary: We train a CTC-based ASR model with auxiliary CTC losses in intermediate layers in addition to the original CTC loss in the last layer.
Our method is easy to implement and retains the merits of CTC-based ASR: a simple model architecture and fast decoding speed.
- Score: 14.376418789524783
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a method to relax the conditional independence assumption
of connectionist temporal classification (CTC)-based automatic speech
recognition (ASR) models. We train a CTC-based ASR model with auxiliary CTC
losses in intermediate layers in addition to the original CTC loss in the last
layer. During both training and inference, each generated prediction in the
intermediate layers is summed to the input of the next layer to condition the
prediction of the last layer on those intermediate predictions. Our method is
easy to implement and retains the merits of CTC-based ASR: a simple model
architecture and fast decoding speed. We conduct experiments on three different
ASR corpora. Our proposed method improves a standard CTC model significantly
(e.g., more than 20 % relative word error rate reduction on the WSJ corpus)
with a little computational overhead. Moreover, for the TEDLIUM2 corpus and the
AISHELL-1 corpus, it achieves a comparable performance to a strong
autoregressive model with beam search, but the decoding speed is at least 30
times faster.
Related papers
- Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter [57.64003871384959]
This work presents a new approach to fast context-biasing with CTC-based Word Spotter.
The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates.
The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER.
arXiv Detail & Related papers (2024-06-11T09:37:52Z) - CTC-based Non-autoregressive Speech Translation [51.37920141751813]
We investigate the potential of connectionist temporal classification for non-autoregressive speech translation.
We develop a model consisting of two encoders that are guided by CTC to predict the source and target texts.
Experiments on the MuST-C benchmarks show that our NAST model achieves an average BLEU score of 29.5 with a speed-up of 5.67$times$.
arXiv Detail & Related papers (2023-05-27T03:54:09Z) - InterMPL: Momentum Pseudo-Labeling with Intermediate CTC Loss [43.39035144463951]
Momentum PL (MPL) trains a connectionist temporal classification ( CTC)-based model on unlabeled data.
CTC is well suited for MPL, or PL-based semi-supervised ASR in general, owing to its simple/fast inference algorithm and robustness against generating collapsed labels.
We propose to enhance MPL by introducing intermediate loss, inspired by the recent advances in CTC-based modeling.
arXiv Detail & Related papers (2022-11-02T00:18:25Z) - Improving CTC-based ASR Models with Gated Interlayer Collaboration [9.930655347717932]
We present a Gated Interlayer Collaboration mechanism which introduces contextual information into the models.
We train the model with intermediate CTC losses calculated by the interlayer outputs of the model, in which the probability distributions of the intermediate layers naturally serve as soft label sequences.
arXiv Detail & Related papers (2022-05-25T03:21:27Z) - InterAug: Augmenting Noisy Intermediate Predictions for CTC-based ASR [17.967459632339374]
We propose InterAug: a novel training method for CTC-based ASR using augmented intermediate representations for conditioning.
The proposed method exploits the conditioning framework of self-conditioned CTC to train robust models by conditioning with "noisy" intermediate predictions.
In experiments using augmentations simulating deletion, insertion, and substitution error, we confirmed that the trained model acquires robustness to each error.
arXiv Detail & Related papers (2022-04-01T02:51:21Z) - CTC Variations Through New WFST Topologies [79.94035631317395]
This paper presents novel Weighted Finite-State Transducer (WFST) topologies to implement Connectionist Temporal Classification (CTC)-like algorithms for automatic speech recognition.
Three new CTC variants are proposed: (1) the "compact-CTC", in which direct transitions between units are replaced with epsilon> back-off transitions; (2) the "minimal-CTC", that only adds blank> self-loops when used in WFST-composition; and (3) "selfless-CTC", that disallows self-loop for non-blank units.
arXiv Detail & Related papers (2021-10-06T23:00:15Z) - Layer Pruning on Demand with Intermediate CTC [50.509073206630994]
We present a training and pruning method for ASR based on the connectionist temporal classification (CTC)
We show that a Transformer-CTC model can be pruned in various depth on demand, improving real-time factor from 0.005 to 0.002 on GPU.
arXiv Detail & Related papers (2021-06-17T02:40:18Z) - Intermediate Loss Regularization for CTC-based Speech Recognition [58.33721897180646]
We present a simple and efficient auxiliary loss function for automatic speech recognition (ASR) based on the connectionist temporal classification ( CTC) objective.
We evaluate the proposed method on various corpora, reaching word error rate (WER) 9.9% on the WSJ corpus and character error rate (CER) 5.2% on the AISHELL-1 corpus respectively.
arXiv Detail & Related papers (2021-02-05T15:01:03Z) - Improved Mask-CTC for Non-Autoregressive End-to-End ASR [49.192579824582694]
Recently proposed end-to-end ASR system based on mask-predict with connectionist temporal classification (CTC)
We propose to enhance the network architecture by employing a recently proposed architecture called Conformer.
Next, we propose new training and decoding methods by introducing auxiliary objective to predict the length of a partial target sequence.
arXiv Detail & Related papers (2020-10-26T01:22:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.