Text-only adaptation in LLM-based ASR through text denoising
- URL: http://arxiv.org/abs/2601.20900v1
- Date: Wed, 28 Jan 2026 10:18:23 GMT
- Title: Text-only adaptation in LLM-based ASR through text denoising
- Authors: Sergio Burdisso, Esaú Villatoro-Tello, Andrés Carofilis, Shashi Kumar, Kadri Hacioglu, Srikanth Madikeri, Pradeep Rangappa, Manjunath K E, Petr Motlicek, Shankar Venkatesan, Andreas Stolcke,
- Abstract summary: Adapting automatic speech recognition systems to new domains using text-only data is a significant yet underexplored challenge.<n>We introduce a novel text-only adaptation method that emulates the audio projection task by treating it as a text denoising task.<n>Our solution is lightweight, requiring no architectural changes or additional parameters.
- Score: 14.200885240373509
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adapting automatic speech recognition (ASR) systems based on large language models (LLMs) to new domains using text-only data is a significant yet underexplored challenge. Standard fine-tuning of the LLM on target-domain text often disrupts the critical alignment between speech and text modalities learned by the projector, degrading performance. We introduce a novel text-only adaptation method that emulates the audio projection task by treating it as a text denoising task. Our approach thus trains the LLM to recover clean transcripts from noisy inputs. This process effectively adapts the model to a target domain while preserving cross-modal alignment. Our solution is lightweight, requiring no architectural changes or additional parameters. Extensive evaluation on two datasets demonstrates up to 22.1% relative improvement, outperforming recent state-of-the-art text-only adaptation methods.
Related papers
- Unifying Speech Editing Detection and Content Localization via Prior-Enhanced Audio LLMs [22.8529107367745]
Speech editing achieves semantic inversion by performing fine-grained segment-level manipulation on original utterances, while preserving global perceptual naturalness.<n>Existing detection studies mainly focus on manually edited speech with explicit splicing artifacts, and therefore struggle to cope with emerging end-to-end neural speech editing techniques.<n>We propose PELM, the first large-model framework that unifies speech editing detection and content localization by formulating them as an audio question answering task.
arXiv Detail & Related papers (2026-01-29T09:39:28Z) - SUTA-LM: Bridging Test-Time Adaptation and Language Model Rescoring for Robust ASR [58.31068047426522]
Test-Time Adaptation (TTA) aims to mitigate by adjusting models during inference.<n>Recent work explores combining TTA with external language models, using techniques like beam search rescoring or generative error correction.<n>We propose SUTA-LM, a simple yet effective extension of SUTA, with language model rescoring.<n> Experiments on 18 diverse ASR datasets show that SUTA-LM achieves robust results across a wide range of domains.
arXiv Detail & Related papers (2025-06-10T02:50:20Z) - Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning [9.950088874229353]
We propose a text-only fine-tuning strategy for Speech LLMs using unpaired target-domain text without requiring additional audio.<n>Experiments on LibriSpeech, SlideSpeech, and Medical datasets show that our method achieves competitive recognition performance.
arXiv Detail & Related papers (2025-06-06T01:34:29Z) - Effective Text Adaptation for LLM-based ASR through Soft Prompt Fine-Tuning [12.676026149146772]
Large Language Models (LLM) has reformed the Automatic Speech Recognition (ASR)<n>Fine-tuning such ASR on text-only data without paired prompts may diminish the effectiveness of domain-specific knowledge.<n>We propose a two-step soft prompt fine-tuning strategy that enhances domain-specific text adaptation.
arXiv Detail & Related papers (2024-12-09T20:22:06Z) - Task Arithmetic can Mitigate Synthetic-to-Real Gap in Automatic Speech Recognition [44.914084799875866]
We show that task vector arithmetic is effective at mitigating the synthetic-to-real gap in speech recognition.
Our proposed method, SYN2REAL, shows an average improvement of 10.03% improvement in word error rate over baselines.
arXiv Detail & Related papers (2024-06-05T04:25:56Z) - Generative Context-aware Fine-tuning of Self-supervised Speech Models [54.389711404209415]
We study the use of generative large language models (LLM) generated context information.
We propose an approach to distill the generated information during fine-tuning of self-supervised speech models.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: automatic speech recognition, named entity recognition, and sentiment analysis.
arXiv Detail & Related papers (2023-12-15T15:46:02Z) - Text-Only Domain Adaptation for End-to-End Speech Recognition through
Down-Sampling Acoustic Representation [67.98338382984556]
Mapping two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains.
In this paper, we proposed novel representations match strategy through down-sampling acoustic representation to align with text modality.
Our ASR model can learn unified representations from both modalities better, allowing for domain adaptation using text-only data of the target domain.
arXiv Detail & Related papers (2023-09-04T08:52:59Z) - Modality Confidence Aware Training for Robust End-to-End Spoken Language
Understanding [18.616202196061966]
End-to-end (E2E) spoken language understanding (SLU) systems that generate a semantic parse from speech have become more promising recently.
This approach uses a single model that utilizes audio and text representations from pre-trained speech recognition models (ASR)
We propose a novel E2E SLU system that enhances robustness to ASR errors by fusing audio and text representations based on the estimated modality confidence of ASR hypotheses.
arXiv Detail & Related papers (2023-07-22T17:47:31Z) - Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation [72.7915031238824]
Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks.
They often suffer from common issues such as semantic misalignment and poor temporal consistency.
We propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio.
arXiv Detail & Related papers (2023-05-29T10:41:28Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.