Related papers: Text-only adaptation in LLM-based ASR through text denoising

Text-only adaptation in LLM-based ASR through text denoising

URL: http://arxiv.org/abs/2601.20900v1
Date: Wed, 28 Jan 2026 10:18:23 GMT
Title: Text-only adaptation in LLM-based ASR through text denoising
Authors: Sergio Burdisso, Esaú Villatoro-Tello, Andrés Carofilis, Shashi Kumar, Kadri Hacioglu, Srikanth Madikeri, Pradeep Rangappa, Manjunath K E, Petr Motlicek, Shankar Venkatesan, Andreas Stolcke,
Abstract summary: Adapting automatic speech recognition systems to new domains using text-only data is a significant yet underexplored challenge.<n>We introduce a novel text-only adaptation method that emulates the audio projection task by treating it as a text denoising task.<n>Our solution is lightweight, requiring no architectural changes or additional parameters.
Score: 14.200885240373509
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Adapting automatic speech recognition (ASR) systems based on large language models (LLMs) to new domains using text-only data is a significant yet underexplored challenge. Standard fine-tuning of the LLM on target-domain text often disrupts the critical alignment between speech and text modalities learned by the projector, degrading performance. We introduce a novel text-only adaptation method that emulates the audio projection task by treating it as a text denoising task. Our approach thus trains the LLM to recover clean transcripts from noisy inputs. This process effectively adapts the model to a target domain while preserving cross-modal alignment. Our solution is lightweight, requiring no architectural changes or additional parameters. Extensive evaluation on two datasets demonstrates up to 22.1% relative improvement, outperforming recent state-of-the-art text-only adaptation methods.

Related papers

Unifying Speech Editing Detection and Content Localization via Prior-Enhanced Audio LLMs [22.8529107367745]
Speech editing achieves semantic inversion by performing fine-grained segment-level manipulation on original utterances, while preserving global perceptual naturalness.<n>Existing detection studies mainly focus on manually edited speech with explicit splicing artifacts, and therefore struggle to cope with emerging end-to-end neural speech editing techniques.<n>We propose PELM, the first large-model framework that unifies speech editing detection and content localization by formulating them as an audio question answering task.
arXiv Detail & Related papers (2026-01-29T09:39:28Z)
SUTA-LM: Bridging Test-Time Adaptation and Language Model Rescoring for Robust ASR [58.31068047426522]
Test-Time Adaptation (TTA) aims to mitigate by adjusting models during inference.<n>Recent work explores combining TTA with external language models, using techniques like beam search rescoring or generative error correction.<n>We propose SUTA-LM, a simple yet effective extension of SUTA, with language model rescoring.<n> Experiments on 18 diverse ASR datasets show that SUTA-LM achieves robust results across a wide range of domains.
arXiv Detail & Related papers (2025-06-10T02:50:20Z)
Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning [9.950088874229353]
We propose a text-only fine-tuning strategy for Speech LLMs using unpaired target-domain text without requiring additional audio.<n>Experiments on LibriSpeech, SlideSpeech, and Medical datasets show that our method achieves competitive recognition performance.
arXiv Detail & Related papers (2025-06-06T01:34:29Z)
Effective Text Adaptation for LLM-based ASR through Soft Prompt Fine-Tuning [12.676026149146772]
Large Language Models (LLM) has reformed the Automatic Speech Recognition (ASR)<n>Fine-tuning such ASR on text-only data without paired prompts may diminish the effectiveness of domain-specific knowledge.<n>We propose a two-step soft prompt fine-tuning strategy that enhances domain-specific text adaptation.
arXiv Detail & Related papers (2024-12-09T20:22:06Z)
Task Arithmetic can Mitigate Synthetic-to-Real Gap in Automatic Speech Recognition [44.914084799875866]
We show that task vector arithmetic is effective at mitigating the synthetic-to-real gap in speech recognition. Our proposed method, SYN2REAL, shows an average improvement of 10.03% improvement in word error rate over baselines.
arXiv Detail & Related papers (2024-06-05T04:25:56Z)
Generative Context-aware Fine-tuning of Self-supervised Speech Models [54.389711404209415]
We study the use of generative large language models (LLM) generated context information. We propose an approach to distill the generated information during fine-tuning of self-supervised speech models. We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: automatic speech recognition, named entity recognition, and sentiment analysis.
arXiv Detail & Related papers (2023-12-15T15:46:02Z)
Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation [67.98338382984556]
Mapping two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains. In this paper, we proposed novel representations match strategy through down-sampling acoustic representation to align with text modality. Our ASR model can learn unified representations from both modalities better, allowing for domain adaptation using text-only data of the target domain.
arXiv Detail & Related papers (2023-09-04T08:52:59Z)
Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding [18.616202196061966]
End-to-end (E2E) spoken language understanding (SLU) systems that generate a semantic parse from speech have become more promising recently. This approach uses a single model that utilizes audio and text representations from pre-trained speech recognition models (ASR) We propose a novel E2E SLU system that enhances robustness to ASR errors by fusing audio and text representations based on the estimated modality confidence of ASR hypotheses.
arXiv Detail & Related papers (2023-07-22T17:47:31Z)
Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation [72.7915031238824]
Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks. They often suffer from common issues such as semantic misalignment and poor temporal consistency. We propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio.
arXiv Detail & Related papers (2023-05-29T10:41:28Z)
Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU) We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z)
Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR) APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker. We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.