RNN-Transducer-based Losses for Speech Recognition on Noisy Targets
- URL: http://arxiv.org/abs/2504.06963v1
- Date: Wed, 09 Apr 2025 15:18:29 GMT
- Title: RNN-Transducer-based Losses for Speech Recognition on Noisy Targets
- Authors: Vladimir Bataev,
- Abstract summary: We introduce novel loss functions to mitigate the impact of transcription errors in RNN-Transducer models.<n>Our Star-Transducer loss addresses deletion errors by incorporating "skip frame" transitions in the loss lattice.<n>The Bypass-Transducer loss uses "skip token" transitions to tackle insertion errors, recovering more than 60% of the quality.
- Score: 1.4685355149711303
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training speech recognition systems on noisy transcripts is a significant challenge in industrial pipelines, where datasets are enormous and ensuring accurate transcription for every instance is difficult. In this work, we introduce novel loss functions to mitigate the impact of transcription errors in RNN-Transducer models. Our Star-Transducer loss addresses deletion errors by incorporating "skip frame" transitions in the loss lattice, restoring over 90% of the system's performance compared to models trained with accurate transcripts. The Bypass-Transducer loss uses "skip token" transitions to tackle insertion errors, recovering more than 60% of the quality. Finally, the Target-Robust Transducer loss merges these approaches, offering robust performance against arbitrary errors. Experimental results demonstrate that the Target-Robust Transducer loss significantly improves RNN-T performance on noisy data by restoring over 70% of the quality compared to well-transcribed data.
Related papers
- Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition [18.50957174600796]
Solution to automatic speech recognition (ASR) of overlapping speakers is to separate speech and then perform ASR on the separated signals.
Currently, the separator produces artefacts which often degrade ASR performance.
This paper proposes a transcription-free method for joint training using only audio signals.
arXiv Detail & Related papers (2024-06-13T08:20:58Z) - Human Transcription Quality Improvement [2.24166568188073]
We introduce two mechanisms to improve transcription quality: confidence estimation based reprocessing at labeling stage, and automatic word error correction at post-labeling stage.
We collect and release LibriCrowd - a large-scale crowdsourced dataset of audio transcriptions on 100 hours of English speech.
arXiv Detail & Related papers (2023-09-24T03:39:43Z) - Enhancing Noise-Robust Losses for Large-Scale Noisy Data Learning [0.0]
Large annotated datasets inevitably contain noisy labels, which poses a major challenge for training deep neural networks as they easily memorize the labels.<n>Noise-robust loss functions have emerged as a notable strategy to counteract this issue, but it remains challenging to create a robust loss function which is not susceptible to underfitting.<n>We propose a novel method denoted as logit bias, which adds a real number $epsilon$ to the logit at the position of the correct class.
arXiv Detail & Related papers (2023-06-08T18:38:55Z) - Powerful and Extensible WFST Framework for RNN-Transducer Losses [71.56212119508551]
This paper presents a framework based on Weighted Finite-State Transducers (WFST) to simplify the development of modifications for RNN-Transducer (RNN-T) loss.
Existing implementations of RNN-T use-related code, which is hard to extend and debug.
We introduce two WFST-powered RNN-T implementations: "Compose-Transducer" and "Grid-Transducer"
arXiv Detail & Related papers (2023-03-18T10:36:33Z) - Boosting Facial Expression Recognition by A Semi-Supervised Progressive
Teacher [54.50747989860957]
We propose a semi-supervised learning algorithm named Progressive Teacher (PT) to utilize reliable FER datasets as well as large-scale unlabeled expression images for effective training.
Experiments on widely-used databases RAF-DB and FERPlus validate the effectiveness of our method, which achieves state-of-the-art performance with accuracy of 89.57% on RAF-DB.
arXiv Detail & Related papers (2022-05-28T07:47:53Z) - Sequence Transduction with Graph-based Supervision [96.04967815520193]
We present a new transducer objective function that generalizes the RNN-T loss to accept a graph representation of the labels.
We demonstrate that transducer-based ASR with CTC-like lattice achieves better results compared to standard RNN-T.
arXiv Detail & Related papers (2021-11-01T21:51:42Z) - Semantic Perturbations with Normalizing Flows for Improved
Generalization [62.998818375912506]
We show that perturbations in the latent space can be used to define fully unsupervised data augmentations.
We find that our latent adversarial perturbations adaptive to the classifier throughout its training are most effective.
arXiv Detail & Related papers (2021-08-18T03:20:00Z) - Transitive Learning: Exploring the Transitivity of Degradations for
Blind Super-Resolution [89.4784684863403]
We propose a novel Transitive Learning method for blind Super-Resolution on transitive degradations (TLSR)
We analyze and demonstrate the transitivity of degradations, including the widely used additive and convolutive degradations.
We show that the proposed TLSR achieves superior performance and consumes less time against the state-of-the-art blind SR methods.
arXiv Detail & Related papers (2021-03-29T02:51:09Z) - Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input [54.82369261350497]
We propose a CTC-enhanced NAR transformer, which generates target sequence by refining predictions of the CTC module.
Experimental results show that our method outperforms all previous NAR counterparts and achieves 50x faster decoding speed than a strong AR baseline with only 0.0 0.3 absolute CER degradation on Aishell-1 and Aishell-2 datasets.
arXiv Detail & Related papers (2020-10-28T15:00:09Z) - Adapting End-to-End Speech Recognition for Readable Subtitles [15.525314212209562]
In some use cases such as subtitling, verbatim transcription would reduce output readability given limited screen size and reading time.
We first investigate a cascaded system, where an unsupervised compression model is used to post-edit the transcribed speech.
Experiments show that with limited data far less than needed for training a model from scratch, we can adapt a Transformer-based ASR model to incorporate both transcription and compression capabilities.
arXiv Detail & Related papers (2020-05-25T14:42:26Z) - Improved Natural Language Generation via Loss Truncation [29.676561106319173]
We show that distinguishability serves as a principled and robust alternative for handling invalid references.
We propose loss truncation, which adaptively removes high loss examples during training.
We show this is as easy to optimize as log loss and tightly bounds distinguishability under noise.
arXiv Detail & Related papers (2020-04-30T05:31:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.