Related papers: Listening, Imagining & Refining: A Heuristic Optimized ASR Correction Framework with LLMs

Listening, Imagining & Refining: A Heuristic Optimized ASR Correction Framework with LLMs

URL: http://arxiv.org/abs/2509.15095v2
Date: Sat, 20 Sep 2025 12:01:56 GMT
Title: Listening, Imagining & Refining: A Heuristic Optimized ASR Correction Framework with LLMs
Authors: Yutong Liu, Ziyue Zhang, Cheng Huang, Yongbin Yu, Xiangxiang Wang, Yuqing Cai, Nyima Tashi,
Abstract summary: LIR-ASR applies a "Listening-Imagining-Refining" strategy, generating phonetic variants and refining them in context.<n>Experiments on both English and Chinese ASR outputs show that LIR-ASR average achieves reductions in CER/WER of up to 1.5 percentage points.
Score: 14.208903819874473
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Automatic Speech Recognition (ASR) systems remain prone to errors that affect downstream applications. In this paper, we propose LIR-ASR, a heuristic optimized iterative correction framework using LLMs, inspired by human auditory perception. LIR-ASR applies a "Listening-Imagining-Refining" strategy, generating phonetic variants and refining them in context. A heuristic optimization with finite state machine (FSM) is introduced to prevent the correction process from being trapped in local optima and rule-based constraints help maintain semantic fidelity. Experiments on both English and Chinese ASR outputs show that LIR-ASR achieves average reductions in CER/WER of up to 1.5 percentage points compared to baselines, demonstrating substantial accuracy gains in transcription.

Related papers

Decomposing and Composing: Towards Efficient Vision-Language Continual Learning via Rank-1 Expert Pool in a Single LoRA [50.97792275353563]
We introduce a novel framework that restructures a single Low-Rank Adaptation (LoRA) module as a decomposable Rank-1 Expert Pool.<n>Our method learns to dynamically compose a sparse, task-specific update by selecting from this expert pool, guided by the semantics of the [Guided] token.
arXiv Detail & Related papers (2026-01-30T10:54:51Z)
Two Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hypotheses [71.34350093068473]
This paper introduces a new paradigm for generative error correction (GER) framework in audio-visual speech recognition (AVSR)<n>Our framework, DualHyp, empowers a large language model (LLM) to compose independent N-best hypotheses from separate automatic speech recognition (ASR) and visual speech recognition (VSR) models.<n>Our framework attains up to 57.7% error rate gain on the LRS2 benchmark over standard ASR baseline, contrary to single-stream GER approaches that achieve only 10% gain.
arXiv Detail & Related papers (2025-10-15T08:27:16Z)
OAT-Rephrase: Optimization-Aware Training Data Rephrasing for Zeroth-Order LLM Fine-Tuning [25.76983801886268]
This paper introduces OAT-Rephrase, an Optimization-Aware Training data rephrasing strategy.<n>We show that OAT-Rephrase consistently improves MeZO fine-tuning performance.<n>Our findings suggest that optimization-aware rephrasing serves as a reusable and low-overhead enhancement for zeroth-order tuning regimes.
arXiv Detail & Related papers (2025-06-10T02:53:04Z)
Effective Inference-Free Retrieval for Learned Sparse Representations [19.54810957623511]
Learned Sparse Retrieval (LSR) is an effective IR approach that exploits pre-trained language models for encoding text into a learned bag of words.<n>Recently, new efficient -- inverted index-based -- retrieval engines have been proposed, leading to a natural question: has the role of regularization changed in training LSR models?<n>We show that regularization can be relaxed to produce more effective LSR encoders.
arXiv Detail & Related papers (2025-04-30T09:10:46Z)
RoseRAG: Robust Retrieval-augmented Generation with Small-scale LLMs via Margin-aware Preference Optimization [53.63439735067081]
Large language models (LLMs) have achieved impressive performance but face high computational costs and latency.<n>Retrieval-augmented generation (RAG) helps by integrating external knowledge, but imperfect retrieval can introduce distracting noise that misleads SLMs.<n>We propose RoseRAG, a robust RAG framework for SLMs via Margin-aware Preference Optimization.
arXiv Detail & Related papers (2025-02-16T04:56:53Z)
GEC-RAG: Improving Generative Error Correction via Retrieval-Augmented Generation for Automatic Speech Recognition Systems [8.669397145785942]
We propose Generative Error Correction via Retrieval-Augmented Generation (GEC-RAG) to improve ASR accuracy for low-resource domains, like Persian.<n>GEC-RAG retrieves lexically similar examples to the ASR transcription using the Term Frequency-Inverse Document Frequency (TF-IDF) measure.
arXiv Detail & Related papers (2025-01-18T11:53:22Z)
Provenance: A Light-weight Fact-checker for Retrieval Augmented LLM Generation Output [49.893971654861424]
We present a light-weight approach for detecting nonfactual outputs from retrieval-augmented generation (RAG) We compute a factuality score that can be thresholded to yield a binary decision. Our experiments show high area under the ROC curve (AUC) across a wide range of relevant open source datasets.
arXiv Detail & Related papers (2024-11-01T20:44:59Z)
LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition [46.438575751932866]
LipGER is a framework for leveraging visual cues for noise-robust ASR. We show that LipGER improves the Word Error Rate in the range of 1.1%-49.2%. We also release LipHyp, a large-scale dataset with hypothesis-transcription pairs equipped with lip motion cues.
arXiv Detail & Related papers (2024-06-06T18:17:59Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
Correction Focused Language Model Training for Speech Recognition [14.246583065323192]
We introduce a novel correction focused LM training approach which aims to prioritize ASR fallible words. The word-level ASR fallibility score is defined and shaped as a prior word distribution to guide the LM training. Compared with conventional LMs, correction focused training achieves up to relatively 5.5% word error rate (WER) reduction in sufficient text scenarios.
arXiv Detail & Related papers (2023-10-17T05:10:39Z)
On Language Model Integration for RNN Transducer based Speech Recognition [49.84285563767935]
We study various ILM correction-based LM integration methods formulated in a common RNN-T framework. We provide a decoding interpretation on two major reasons for performance improvement with ILM correction. We also propose an exact-ILM training framework by extending the proof given in the hybrid autoregressive transducer.
arXiv Detail & Related papers (2021-10-13T16:30:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.