Related papers: Less Peaky and More Accurate CTC Forced Alignment by Label Priors

Less Peaky and More Accurate CTC Forced Alignment by Label Priors

URL: http://arxiv.org/abs/2406.02560v3
Date: Thu, 18 Jul 2024 18:28:45 GMT
Title: Less Peaky and More Accurate CTC Forced Alignment by Label Priors
Authors: Ruizhe Huang, Xiaohui Zhang, Zhaoheng Ni, Li Sun, Moto Hira, Jeff Hwang, Vimal Manohar, Vineel Pratap, Matthew Wiesner, Shinji Watanabe, Daniel Povey, Sanjeev Khudanpur,
Abstract summary: Connectionist temporal classification (CTC) models are known to have peaky output distributions. This paper aims at alleviating the peaky behavior for CTC and improve its suitability for forced alignment generation. Our CTC model produces less peaky posteriors and is able to more accurately predict the offset of the tokens besides their onset.
Score: 57.48450905027108
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Connectionist temporal classification (CTC) models are known to have peaky output distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it can cause inaccurate forced alignments (FA), especially at finer granularity, e.g., phoneme level. This paper aims at alleviating the peaky behavior for CTC and improve its suitability for forced alignment generation, by leveraging label priors, so that the scores of alignment paths containing fewer blanks are boosted and maximized during training. As a result, our CTC model produces less peaky posteriors and is able to more accurately predict the offset of the tokens besides their onset. It outperforms the standard CTC model and a heuristics-based approach for obtaining CTC's token offset timestamps by 12-40% in phoneme and word boundary errors (PBE and WBE) measured on the Buckeye and TIMIT data. Compared with the most widely used FA toolkit Montreal Forced Aligner (MFA), our method performs similarly on PBE/WBE on Buckeye, yet falls behind MFA on TIMIT. Nevertheless, our method has a much simpler training pipeline and better runtime efficiency. Our training recipe and pretrained model are released in TorchAudio.

Related papers

REPA Works Until It Doesn't: Early-Stopped, Holistic Alignment Supercharges Diffusion Training [58.33728862521732]
Diffusion Transformers (DiTs) deliver state-of-the-art image quality, yet their training remains notoriously slow.<n>A recent remedy -- representation alignment (REPA) that matches DiT hidden features to those of a non-generative teacher (e.g. DINO) -- dramatically accelerates the early epochs but plateaus or even degrades performance later.<n>We trace this failure to a capacity mismatch: once the generative student begins modelling the joint data distribution, the teacher's lower-dimensional embeddings and attention patterns become a straitjacket rather than a guide.<n>We introduce HASTE
arXiv Detail & Related papers (2025-05-22T15:34:33Z)
Is Less More? Exploring Token Condensation as Training-free Test-time Adaptation [43.09801987385207]
Contrastive Language-Image Pretraining (CLIP) excels at learning generalizable image representations but often falls short in zero-shot inference on certain datasets. Test-time adaptation (TTA) mitigates this issue by adjusting components like normalization layers or context prompts, yet it typically requires large batch sizes and extensive augmentations. We propose Token Condensation as Adaptation (TCA), a training-free adaptation method that takes a step beyond standard TC.
arXiv Detail & Related papers (2024-10-16T07:13:35Z)
Adaptive Hierarchical Certification for Segmentation using Randomized Smoothing [87.48628403354351]
certification for machine learning is proving that no adversarial sample can evade a model within a range under certain conditions. Common certification methods for segmentation use a flat set of fine-grained classes, leading to high abstain rates due to model uncertainty. We propose a novel, more practical setting, which certifies pixels within a multi-level hierarchy, and adaptively relaxes the certification to a coarser level for unstable components.
arXiv Detail & Related papers (2024-02-13T11:59:43Z)
Self-distillation Regularized Connectionist Temporal Classification Loss for Text Recognition: A Simple Yet Effective Approach [14.69981874614434]
We show how to better optimize a text recognition model from the perspective of loss functions. CTC-based methods, widely used in practice due to their good balance between performance and inference speed, still grapple with degradation accuracy. We propose a self-distillation scheme for CTC-based model to address this issue.
arXiv Detail & Related papers (2023-08-17T06:32:57Z)
ProTeCt: Prompt Tuning for Taxonomic Open Set Classification [59.59442518849203]
Few-shot adaptation methods do not fare well in the taxonomic open set (TOS) setting. We propose a prompt tuning technique that calibrates the hierarchical consistency of model predictions. A new Prompt Tuning for Hierarchical Consistency (ProTeCt) technique is then proposed to calibrate classification across label set granularities.
arXiv Detail & Related papers (2023-06-04T02:55:25Z)
Pre-training for Speech Translation: CTC Meets Optimal Transport [29.807861658249923]
We show that the connectionist temporal classification (CTC) loss can reduce the modality gap by design. We propose a novel pre-training method combining CTC and optimal transport to further reduce this gap. Our method pre-trains a Siamese-like model composed of two encoders, one for acoustic inputs and the other for textual inputs, such that they produce representations that are close to each other in the Wasserstein space.
arXiv Detail & Related papers (2023-01-27T14:03:09Z)
Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities. Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z)
Non-autoregressive Error Correction for CTC-based ASR with Phone-conditioned Masked LM [39.03817586745041]
We propose an error correction method with phone-conditioned masked LM (PC-MLM) Since both CTC and PC-MLM are non-autoregressive models, the method enables fast LM integration.
arXiv Detail & Related papers (2022-09-08T23:42:37Z)
Efficient One Pass Self-distillation with Zipf's Label Smoothing [12.626049767353386]
Self-distillation exploits non-uniform soft supervision from itself during training and improves performance without any runtime cost. This paper proposes Zipf's Label Smoothing (Zipf's LS), which uses the on-the-fly prediction of a network to generate soft supervision that conforms to Zipf distribution. Our technique achieves +3.61% accuracy gain compared to the vanilla baseline, and 0.88% more gain against the previous label smoothing or self-distillation strategies.
arXiv Detail & Related papers (2022-07-26T15:40:16Z)
Supervision-Guided Codebooks for Masked Prediction in Speech Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition. We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z)
CTC-synchronous Training for Monotonic Attention Model [43.0382262234792]
backward probabilities cannot be leveraged in the alignment process during training due to left-to-right dependency in the decoder. We propose CTC-synchronous training ( CTC-ST), in which MoChA uses CTC alignments to learn optimal monotonic alignments. The entire model is jointly optimized so that the expected boundaries from MoChA are synchronized with the alignments.
arXiv Detail & Related papers (2020-05-10T16:48:23Z)
Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks. We introduce a new scoring method that casts a plausibility ranking task in a full-text format. We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.