Related papers: WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers

WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers

URL: http://arxiv.org/abs/2509.10452v1
Date: Fri, 12 Sep 2025 17:59:09 GMT
Title: WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers
Authors: Akshat Pandey, Karun Kumar, Raphael Tang,
Abstract summary: WhisTLE is a text-only adaptation method for pretrained speech recognition models.<n>It reduces word error rate (WER) by 12.3% relative to TTS-only adaptation.<n>It outperforms all non-WhisTLE baselines in 27 of 32 scenarios.
Score: 6.199846360255783
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen vocabulary and parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned text-to-latent encoder, optionally combined with text-to-speech (TTS) adaptation. At inference, the original encoder is restored, incurring no extra runtime cost. Across four out-of-domain datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by 12.3% relative to TTS-only adaptation and outperforms all non-WhisTLE baselines in 27 of 32 scenarios.

Related papers

Text-Aware Adapter for Few-Shot Keyword Spotting [13.040457187781671]
We propose a novel few-shot transfer learning method, called text-aware adapter (TA-adapter)<n>In our experiments, the TA-adapter demonstrated significant performance improvements across 35 distinct keywords from the Google Speech Commands V2 dataset.
arXiv Detail & Related papers (2024-12-24T03:54:40Z)
Full-text Error Correction for Chinese Speech Recognition with Large Language Model [11.287933170894311]
Large Language Models (LLMs) have demonstrated substantial potential for error correction in Automatic Speech Recognition (ASR)<n>This paper investigates the effectiveness of LLMs for error correction in full-text generated by ASR systems from longer speech recordings.
arXiv Detail & Related papers (2024-09-12T06:50:45Z)
Text-only Domain Adaptation using Unified Speech-Text Representation in Transducer [12.417314740402587]
We present a method to learn Unified Speech-Text Representation in Conformer Transducer(USTR-CT) to enable fast domain adaptation using the text-only corpus. Experiments on adapting LibriSpeech to SPGISpeech show the proposed method reduces the word error rate(WER) by relatively 44% on the target domain.
arXiv Detail & Related papers (2023-06-07T00:33:02Z)
Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T) We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces. Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z)
Text-only domain adaptation for end-to-end ASR using integrated text-to-mel-spectrogram generator [17.44686265224974]
We propose an end-to-end Automatic Speech Recognition (ASR) system that can be trained on transcribed speech data, text-only data, or a mixture of both. We demonstrate that the proposed training method significantly improves ASR accuracy compared to the system trained on transcribed speech only.
arXiv Detail & Related papers (2023-02-27T18:47:55Z)
SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder. Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z)
M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation [66.92823764664206]
We propose M-Adapter, a novel Transformer-based module, to adapt speech representations to text. While shrinking the speech sequence, M-Adapter produces features desired for speech-to-text translation. Our experimental results show that our model outperforms a strong baseline by up to 1 BLEU.
arXiv Detail & Related papers (2022-07-03T04:26:53Z)
Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker. We generate the mel-spectrogram of the edited speech with a transformer-based decoder. It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z)
Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and Backward Transformers [49.403414751667135]
This paper proposes a novel label-synchronous speech-to-text alignment technique for automatic speech recognition (ASR) The proposed method re-defines the speech-to-text alignment as a label-synchronous text mapping problem. Experiments using the corpus of spontaneous Japanese (CSJ) demonstrate that the proposed method provides an accurate utterance-wise alignment.
arXiv Detail & Related papers (2021-04-21T03:05:12Z)
AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data [115.38309338462588]
We develop AdaSpeech 2, an adaptive TTS system that only leverages untranscribed speech data for adaptation. Specifically, we introduce a mel-spectrogram encoder to a well-trained TTS model to conduct speech reconstruction. In adaptation, we use untranscribed speech data for speech reconstruction and only fine-tune the TTS decoder.
arXiv Detail & Related papers (2021-04-20T01:53:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.