Related papers: ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models

ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models

URL: http://arxiv.org/abs/2602.18721v1
Date: Sat, 21 Feb 2026 05:04:22 GMT
Title: ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models
Authors: Zefang Liu, Chenyang Zhu, Sangwoo Cho, Shi-Xiong Zhang,
Abstract summary: ReHear is a framework for iterative pseudo-label refinement in automatic speech recognition.<n>It integrates an instruction-tuned, audio-aware large language model into the self-training loop.<n>We show that ReHear effectively mitigates error propagation, consistently outperforming both supervised and pseudo-labeling baselines.
Score: 12.527207210862151
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Semi-supervised learning in automatic speech recognition (ASR) typically relies on pseudo-labeling, which often suffers from confirmation bias and error accumulation due to noisy supervision. To address this limitation, we propose ReHear, a framework for iterative pseudo-label refinement that integrates an instruction-tuned, audio-aware large language model (LLM) into the self-training loop. Unlike conventional text-based correctors, our approach conditions the LLM on both the ASR hypothesis and the source audio, allowing it to recover phonetically accurate transcripts even from severe recognition errors. These refined pseudo-labels serve as high-fidelity targets for fine-tuning the ASR model in an iterative cycle. Experimental results across diverse benchmarks demonstrate that ReHear effectively mitigates error propagation, consistently outperforming both supervised and pseudo-labeling baselines.

Related papers

Two Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hypotheses [71.34350093068473]
This paper introduces a new paradigm for generative error correction (GER) framework in audio-visual speech recognition (AVSR)<n>Our framework, DualHyp, empowers a large language model (LLM) to compose independent N-best hypotheses from separate automatic speech recognition (ASR) and visual speech recognition (VSR) models.<n>Our framework attains up to 57.7% error rate gain on the LRS2 benchmark over standard ASR baseline, contrary to single-stream GER approaches that achieve only 10% gain.
arXiv Detail & Related papers (2025-10-15T08:27:16Z)
Pseudo2Real: Task Arithmetic for Pseudo-Label Correction in Automatic Speech Recognition [61.712328155788434]
Real-world systems encounter unseen accents and domains with limited labeled data.<n> pseudo-labeling often introduces systematic, accent-specific errors that filtering fails to fix.<n>We propose a simple parameter-space correction to correct these recurring biases without target ground truth.
arXiv Detail & Related papers (2025-10-09T10:31:47Z)
Spelling Correction through Rewriting of Non-Autoregressive ASR Lattices [8.77712061194924]
We present a finite-state transducer (FST) technique for rewriting wordpiece lattices generated by Transformer-based CTC models. Our algorithm performs grapheme-to-phoneme (G2P) conversion directly from wordpieces into phonemes, avoiding explicit word representations. We achieved up to a 15.2% relative reduction in sentence error rate (SER) on a test set with contextually relevant entities.
arXiv Detail & Related papers (2024-09-24T21:42:25Z)
Error Correction by Paying Attention to Both Acoustic and Confidence References for Automatic Speech Recognition [52.624909026294105]
We propose a non-autoregressive speech error correction method. A Confidence Module measures the uncertainty of each word of the N-best ASR hypotheses. The proposed system reduces the error rate by 21% compared with the ASR model.
arXiv Detail & Related papers (2024-06-29T17:56:28Z)
Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice. We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z)
HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction. The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses. LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z)
Error Correction in ASR using Sequence-to-Sequence Models [32.41875780785648]
Post-editing in Automatic Speech Recognition entails automatically correcting common and systematic errors produced by the ASR system. We propose to use a powerful pre-trained sequence-to-sequence model, BART, to serve as a denoising model. Experimental results on accented speech data demonstrate that our strategy effectively rectifies a significant number of ASR errors.
arXiv Detail & Related papers (2022-02-02T17:32:59Z)
A Light-weight contextual spelling correction model for customizing transducer-based speech recognition systems [42.05399301143457]
We introduce a light-weight contextual spelling correction model to correct context-related recognition errors. Experiments show that the model improves baseline ASR model performance with about 50% relative word error rate reduction. The model also shows excellent performance for out-of-vocabulary terms not seen during training.
arXiv Detail & Related papers (2021-08-17T08:14:37Z)
Semi-Supervised Speech Recognition via Graph-based Temporal Classification [59.58318952000571]
Semi-supervised learning has demonstrated promising results in automatic speech recognition by self-training. The effectiveness of this approach largely relies on the pseudo-label accuracy. Alternative ASR hypotheses of an N-best list can provide more accurate labels for an unlabeled speech utterance.
arXiv Detail & Related papers (2020-10-29T14:56:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.