Listening, Imagining & Refining: A Heuristic Optimized ASR Correction Framework with LLMs
- URL: http://arxiv.org/abs/2509.15095v2
- Date: Sat, 20 Sep 2025 12:01:56 GMT
- Title: Listening, Imagining & Refining: A Heuristic Optimized ASR Correction Framework with LLMs
- Authors: Yutong Liu, Ziyue Zhang, Cheng Huang, Yongbin Yu, Xiangxiang Wang, Yuqing Cai, Nyima Tashi,
- Abstract summary: LIR-ASR applies a "Listening-Imagining-Refining" strategy, generating phonetic variants and refining them in context.<n>Experiments on both English and Chinese ASR outputs show that LIR-ASR average achieves reductions in CER/WER of up to 1.5 percentage points.
- Score: 14.208903819874473
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Automatic Speech Recognition (ASR) systems remain prone to errors that affect downstream applications. In this paper, we propose LIR-ASR, a heuristic optimized iterative correction framework using LLMs, inspired by human auditory perception. LIR-ASR applies a "Listening-Imagining-Refining" strategy, generating phonetic variants and refining them in context. A heuristic optimization with finite state machine (FSM) is introduced to prevent the correction process from being trapped in local optima and rule-based constraints help maintain semantic fidelity. Experiments on both English and Chinese ASR outputs show that LIR-ASR achieves average reductions in CER/WER of up to 1.5 percentage points compared to baselines, demonstrating substantial accuracy gains in transcription.
Related papers
- Decomposing and Composing: Towards Efficient Vision-Language Continual Learning via Rank-1 Expert Pool in a Single LoRA [50.97792275353563]
We introduce a novel framework that restructures a single Low-Rank Adaptation (LoRA) module as a decomposable Rank-1 Expert Pool.<n>Our method learns to dynamically compose a sparse, task-specific update by selecting from this expert pool, guided by the semantics of the [Guided] token.
arXiv Detail & Related papers (2026-01-30T10:54:51Z) - Two Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hypotheses [71.34350093068473]
This paper introduces a new paradigm for generative error correction (GER) framework in audio-visual speech recognition (AVSR)<n>Our framework, DualHyp, empowers a large language model (LLM) to compose independent N-best hypotheses from separate automatic speech recognition (ASR) and visual speech recognition (VSR) models.<n>Our framework attains up to 57.7% error rate gain on the LRS2 benchmark over standard ASR baseline, contrary to single-stream GER approaches that achieve only 10% gain.
arXiv Detail & Related papers (2025-10-15T08:27:16Z) - OAT-Rephrase: Optimization-Aware Training Data Rephrasing for Zeroth-Order LLM Fine-Tuning [25.76983801886268]
This paper introduces OAT-Rephrase, an Optimization-Aware Training data rephrasing strategy.<n>We show that OAT-Rephrase consistently improves MeZO fine-tuning performance.<n>Our findings suggest that optimization-aware rephrasing serves as a reusable and low-overhead enhancement for zeroth-order tuning regimes.
arXiv Detail & Related papers (2025-06-10T02:53:04Z) - Effective Inference-Free Retrieval for Learned Sparse Representations [19.54810957623511]
Learned Sparse Retrieval (LSR) is an effective IR approach that exploits pre-trained language models for encoding text into a learned bag of words.<n>Recently, new efficient -- inverted index-based -- retrieval engines have been proposed, leading to a natural question: has the role of regularization changed in training LSR models?<n>We show that regularization can be relaxed to produce more effective LSR encoders.
arXiv Detail & Related papers (2025-04-30T09:10:46Z) - RoseRAG: Robust Retrieval-augmented Generation with Small-scale LLMs via Margin-aware Preference Optimization [53.63439735067081]
Large language models (LLMs) have achieved impressive performance but face high computational costs and latency.<n>Retrieval-augmented generation (RAG) helps by integrating external knowledge, but imperfect retrieval can introduce distracting noise that misleads SLMs.<n>We propose RoseRAG, a robust RAG framework for SLMs via Margin-aware Preference Optimization.
arXiv Detail & Related papers (2025-02-16T04:56:53Z) - GEC-RAG: Improving Generative Error Correction via Retrieval-Augmented Generation for Automatic Speech Recognition Systems [8.669397145785942]
We propose Generative Error Correction via Retrieval-Augmented Generation (GEC-RAG) to improve ASR accuracy for low-resource domains, like Persian.<n>GEC-RAG retrieves lexically similar examples to the ASR transcription using the Term Frequency-Inverse Document Frequency (TF-IDF) measure.
arXiv Detail & Related papers (2025-01-18T11:53:22Z) - Provenance: A Light-weight Fact-checker for Retrieval Augmented LLM Generation Output [49.893971654861424]
We present a light-weight approach for detecting nonfactual outputs from retrieval-augmented generation (RAG)
We compute a factuality score that can be thresholded to yield a binary decision.
Our experiments show high area under the ROC curve (AUC) across a wide range of relevant open source datasets.
arXiv Detail & Related papers (2024-11-01T20:44:59Z) - LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition [46.438575751932866]
LipGER is a framework for leveraging visual cues for noise-robust ASR.
We show that LipGER improves the Word Error Rate in the range of 1.1%-49.2%.
We also release LipHyp, a large-scale dataset with hypothesis-transcription pairs equipped with lip motion cues.
arXiv Detail & Related papers (2024-06-06T18:17:59Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Correction Focused Language Model Training for Speech Recognition [14.246583065323192]
We introduce a novel correction focused LM training approach which aims to prioritize ASR fallible words.
The word-level ASR fallibility score is defined and shaped as a prior word distribution to guide the LM training.
Compared with conventional LMs, correction focused training achieves up to relatively 5.5% word error rate (WER) reduction in sufficient text scenarios.
arXiv Detail & Related papers (2023-10-17T05:10:39Z) - On Language Model Integration for RNN Transducer based Speech
Recognition [49.84285563767935]
We study various ILM correction-based LM integration methods formulated in a common RNN-T framework.
We provide a decoding interpretation on two major reasons for performance improvement with ILM correction.
We also propose an exact-ILM training framework by extending the proof given in the hybrid autoregressive transducer.
arXiv Detail & Related papers (2021-10-13T16:30:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.