Related papers: WST: Weakly Supervised Transducer for Automatic Speech Recognition

WST: Weakly Supervised Transducer for Automatic Speech Recognition

URL: http://arxiv.org/abs/2511.04035v1
Date: Thu, 06 Nov 2025 04:14:07 GMT
Title: WST: Weakly Supervised Transducer for Automatic Speech Recognition
Authors: Dongji Gao, Chenda Liao, Changliang Liu, Matthew Wiesner, Leibny Paola Garcia, Daniel Povey, Sanjeev Khudanpur, Jian Wu,
Abstract summary: WeaklySupervised Transducer (WST) is designed to robustly handle errors in the transcripts without requiring additional confidence estimation or auxiliary pre-trained models.<n> Empirical evaluations on synthetic and industrial datasets reveal that WST effectively maintains performance even with transcription error rates of up to 70%.
Score: 26.373816643181843
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Recurrent Neural Network-Transducer (RNN-T) is widely adopted in end-to-end (E2E) automatic speech recognition (ASR) tasks but depends heavily on large-scale, high-quality annotated data, which are often costly and difficult to obtain. To mitigate this reliance, we propose a Weakly Supervised Transducer (WST), which integrates a flexible training graph designed to robustly handle errors in the transcripts without requiring additional confidence estimation or auxiliary pre-trained models. Empirical evaluations on synthetic and industrial datasets reveal that WST effectively maintains performance even with transcription error rates of up to 70%, consistently outperforming existing Connectionist Temporal Classification (CTC)-based weakly supervised approaches, such as Bypass Temporal Classification (BTC) and Omni-Temporal Classification (OTC). These results demonstrate the practical utility and robustness of WST in realistic ASR settings. The implementation will be publicly available.

Related papers

Neural Induction of Finite-State Transducers [13.274838371184432]
We propose a novel method for automatically constructing unweighted Finite-State Transducers (FSTs) following the hidden state geometry learned by a recurrent neural network.<n>We evaluate our methods on real-world datasets for morphological inflection, grapheme-to-phoneme prediction, and historical normalization, showing that the constructed FSTs are highly accurate and robust for many datasets.
arXiv Detail & Related papers (2026-01-16T00:26:57Z)
FAIM: Frequency-Aware Interactive Mamba for Time Series Classification [87.84511960413715]
Time series classification (TSC) is crucial in numerous real-world applications, such as environmental monitoring, medical diagnosis, and posture recognition.<n>We propose FAIM, a lightweight Frequency-Aware Interactive Mamba model.<n>We show that FAIM consistently outperforms existing state-of-the-art (SOTA) methods, achieving a superior trade-off between accuracy and efficiency.
arXiv Detail & Related papers (2025-11-26T08:36:33Z)
HENT-SRT: Hierarchical Efficient Neural Transducer with Self-Distillation for Joint Speech Recognition and Translation [19.997594859651233]
HENT-SRT is a novel framework that factorizes ASR and translation tasks to better handle reordering.<n>We improve computational efficiency by incorporating best practices from ASR transducers.<n>Our approach is evaluated on three conversational datasets Arabic, Spanish, and Mandarin.
arXiv Detail & Related papers (2025-06-02T18:37:50Z)
Efficient Test-Time Prompt Tuning for Vision-Language Models [41.90997623029582]
Self-TPT is a framework leveraging Self-supervised learning for efficient Test-time Prompt Tuning. We show that Self-TPT not only significantly reduces inference costs but also achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-08-11T13:55:58Z)
Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice. We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z)
Bypass Temporal Classification: Weakly Supervised Automatic Speech Recognition with Imperfect Transcripts [44.16141704545044]
We present a novel algorithm for building an automatic speech recognition (ASR) model with imperfect training data. The proposed algorithm improves the robustness and accuracy of ASR systems, particularly when working with imprecisely transcribed speech corpora.
arXiv Detail & Related papers (2023-06-01T14:56:19Z)
A CTC Alignment-based Non-autoregressive Transformer for End-to-end Automatic Speech Recognition [26.79184118279807]
We present a CTC Alignment-based Single-Step Non-Autoregressive Transformer (CASS-NAT) for end-to-end ASR. word embeddings in the autoregressive transformer (AT) are substituted with token-level acoustic embeddings (TAE) that are extracted from encoder outputs. We find that CASS-NAT has a WER that is close to AT on various ASR tasks, while providing a 24x inference speedup.
arXiv Detail & Related papers (2023-04-15T18:34:29Z)
Towards Long-Term Time-Series Forecasting: Feature, Pattern, and Distribution [57.71199089609161]
Long-term time-series forecasting (LTTF) has become a pressing demand in many applications, such as wind power supply planning. Transformer models have been adopted to deliver high prediction capacity because of the high computational self-attention mechanism. We propose an efficient Transformerbased model, named Conformer, which differentiates itself from existing methods for LTTF in three aspects.
arXiv Detail & Related papers (2023-01-05T13:59:29Z)
HFedMS: Heterogeneous Federated Learning with Memorable Data Semantics in Industrial Metaverse [49.1501082763252]
This paper presents HFEDMS for incorporating practical FL into the emerging Industrial Metaverse. It reduces data heterogeneity through dynamic grouping and training mode conversion. Then, it compensates for the forgotten knowledge by fusing compressed historical data semantics. Experiments have been conducted on the streamed non-i.i.d. FEMNIST dataset using 368 simulated devices.
arXiv Detail & Related papers (2022-11-07T04:33:24Z)
Model-based Deep Learning Receiver Design for Rate-Splitting Multiple Access [65.21117658030235]
This work proposes a novel design for a practical RSMA receiver based on model-based deep learning (MBDL) methods. The MBDL receiver is evaluated in terms of uncoded Symbol Error Rate (SER), throughput performance through Link-Level Simulations (LLS) and average training overhead. Results reveal that the MBDL outperforms by a significant margin the SIC receiver with imperfect CSIR.
arXiv Detail & Related papers (2022-05-02T12:23:55Z)
Sequence Transduction with Graph-based Supervision [96.04967815520193]
We present a new transducer objective function that generalizes the RNN-T loss to accept a graph representation of the labels. We demonstrate that transducer-based ASR with CTC-like lattice achieves better results compared to standard RNN-T.
arXiv Detail & Related papers (2021-11-01T21:51:42Z)
Unsupervised Domain Adaptation for Speech Recognition via Uncertainty Driven Self-Training [55.824641135682725]
Domain adaptation experiments using WSJ as a source domain and TED-LIUM 3 as well as SWITCHBOARD show that up to 80% of the performance of a system trained on ground-truth data can be recovered.
arXiv Detail & Related papers (2020-11-26T18:51:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.