Related papers: Efficient CTC Regularization via Coarse Labels for End-to-End Speech Translation

Efficient CTC Regularization via Coarse Labels for End-to-End Speech Translation

URL: http://arxiv.org/abs/2302.10871v1
Date: Tue, 21 Feb 2023 18:38:41 GMT
Title: Efficient CTC Regularization via Coarse Labels for End-to-End Speech Translation
Authors: Biao Zhang and Barry Haddow and Rico Sennrich
Abstract summary: We re-examine the need for genuine vocabulary labels for Connectionist Temporal Classification (CTC) for regularization. We propose coarse labeling for CTC, which merges vocabulary labels via simple rules, such as using truncation, division or modulo (MOD) operations. We show that CoLaCTC successfully generalizes to CTC regularization regardless of using transcript or translation for labeling.
Score: 48.203394370942505
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: For end-to-end speech translation, regularizing the encoder with the Connectionist Temporal Classification (CTC) objective using the source transcript or target translation as labels can greatly improve quality metrics. However, CTC demands an extra prediction layer over the vocabulary space, bringing in nonnegligible model parameters and computational overheads, although this layer is typically not used for inference. In this paper, we re-examine the need for genuine vocabulary labels for CTC for regularization and explore strategies to reduce the CTC label space, targeting improved efficiency without quality degradation. We propose coarse labeling for CTC (CoLaCTC), which merges vocabulary labels via simple heuristic rules, such as using truncation, division or modulo (MOD) operations. Despite its simplicity, our experiments on 4 source and 8 target languages show that CoLaCTC with MOD particularly can compress the label space aggressively to 256 and even further, gaining training efficiency (1.18x ~ 1.77x speedup depending on the original vocabulary size) yet still delivering comparable or better performance than the CTC baseline. We also show that CoLaCTC successfully generalizes to CTC regularization regardless of using transcript or translation for labeling.

Related papers

CTC-GMM: CTC guided modality matching for fast and accurate streaming speech translation [36.417792361080615]
We introduce a methodology named Connectionist Temporal Classification guided modality matching ( CTC-GMM) This technique employs CTC to compress the speech sequence into a compact embedding sequence that matches the corresponding text sequence. Our evaluations with FLEURS and CoVoST2 show that the CTC-GMM approach can increase translation accuracy relatively by 13.9% and 6.4% respectively.
arXiv Detail & Related papers (2024-10-07T15:58:03Z)
CR-CTC: Consistency regularization on CTC for improved speech recognition [18.996929774821822]
Connectionist Temporal Classification (CTC) is a widely used method for automatic speech recognition (ASR) However, it often falls short in recognition performance compared to transducer or systems combining CTC and attention-based encoder-decoder (CTC/AED) We propose the Consistency-Regularized CTC (CR-CTC), which enforces consistency between two CTC distributions obtained from different augmented views of the input speech mel-spectrogram.
arXiv Detail & Related papers (2024-10-07T14:56:07Z)
Self-distillation Regularized Connectionist Temporal Classification Loss for Text Recognition: A Simple Yet Effective Approach [14.69981874614434]
We show how to better optimize a text recognition model from the perspective of loss functions. CTC-based methods, widely used in practice due to their good balance between performance and inference speed, still grapple with degradation accuracy. We propose a self-distillation scheme for CTC-based model to address this issue.
arXiv Detail & Related papers (2023-08-17T06:32:57Z)
CTC-based Non-autoregressive Speech Translation [51.37920141751813]
We investigate the potential of connectionist temporal classification for non-autoregressive speech translation. We develop a model consisting of two encoders that are guided by CTC to predict the source and target texts. Experiments on the MuST-C benchmarks show that our NAST model achieves an average BLEU score of 29.5 with a speed-up of 5.67$times$.
arXiv Detail & Related papers (2023-05-27T03:54:09Z)
CTC Alignments Improve Autoregressive Translation [145.90587287444976]
We argue that CTC does in fact make sense for translation if applied in a joint CTC/attention framework. Our proposed joint CTC/attention models outperform pure-attention baselines across six benchmark translation tasks.
arXiv Detail & Related papers (2022-10-11T07:13:50Z)
CTC Variations Through New WFST Topologies [79.94035631317395]
This paper presents novel Weighted Finite-State Transducer (WFST) topologies to implement Connectionist Temporal Classification (CTC)-like algorithms for automatic speech recognition. Three new CTC variants are proposed: (1) the "compact-CTC", in which direct transitions between units are replaced with epsilon> back-off transitions; (2) the "minimal-CTC", that only adds blank> self-loops when used in WFST-composition; and (3) "selfless-CTC", that disallows self-loop for non-blank units.
arXiv Detail & Related papers (2021-10-06T23:00:15Z)
Investigating the Reordering Capability in CTC-based Non-Autoregressive End-to-End Speech Translation [62.943925893616196]
We study the possibilities of building a non-autoregressive speech-to-text translation model using connectionist temporal classification (CTC) CTC's success on translation is counter-intuitive due to its monotonicity assumption, so we analyze its reordering capability. Our analysis shows that transformer encoders have the ability to change the word order.
arXiv Detail & Related papers (2021-05-11T07:48:45Z)
Intermediate Loss Regularization for CTC-based Speech Recognition [58.33721897180646]
We present a simple and efficient auxiliary loss function for automatic speech recognition (ASR) based on the connectionist temporal classification ( CTC) objective. We evaluate the proposed method on various corpora, reaching word error rate (WER) 9.9% on the WSJ corpus and character error rate (CER) 5.2% on the AISHELL-1 corpus respectively.
arXiv Detail & Related papers (2021-02-05T15:01:03Z)
Reducing Spelling Inconsistencies in Code-Switching ASR using Contextualized CTC Loss [5.707652271634435]
We propose Contextualized Connectionist Temporal Classification (CCTC) loss to encourage spelling consistencies. CCTC loss does not require frame-level alignments, since the context ground truth is obtained from the model's estimated path. Compared to the same model trained with regular CTC loss, our method consistently improved the ASR performance on both CS and monolingual corpora.
arXiv Detail & Related papers (2020-05-16T09:36:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.