Effective Text Adaptation for LLM-based ASR through Soft Prompt Fine-Tuning
- URL: http://arxiv.org/abs/2412.06967v1
- Date: Mon, 09 Dec 2024 20:22:06 GMT
- Title: Effective Text Adaptation for LLM-based ASR through Soft Prompt Fine-Tuning
- Authors: Yingyi Ma, Zhe Liu, Ozlem Kalinli,
- Abstract summary: Large Language Models (LLM) has reformed the Automatic Speech Recognition (ASR)
Fine-tuning such ASR on text-only data without paired prompts may diminish the effectiveness of domain-specific knowledge.
We propose a two-step soft prompt fine-tuning strategy that enhances domain-specific text adaptation.
- Score: 12.676026149146772
- License:
- Abstract: The advent of Large Language Models (LLM) has reformed the Automatic Speech Recognition (ASR). Prompting LLM with audio embeddings to generate transcriptions becomes the new state-of-the-art ASR. Despite LLMs being trained with an extensive amount of text corpora, high-quality domain-specific text data can still significantly enhance ASR performance on domain adaptation tasks. Although LLM-based ASR can naturally incorporate more text corpora by fine-tuning the LLM decoder, fine-tuning such ASR on text-only data without paired prompts may diminish the effectiveness of domain-specific knowledge. To mitigate this issue, we propose a two-step soft prompt fine-tuning strategy that enhances domain-specific text adaptation. Experimental results show that text adaptation with our proposed method achieved a relative up to 9% Word Error Rate (WER) reduction and up to 18% Entity Error Rate (EER) reduction on the target domain compared to the baseline ASR. Combining this with domain-specific Language Model (LM) fusion can further improve the EER by a relative 2-5%
Related papers
- Delayed Fusion: Integrating Large Language Models into First-Pass Decoding in End-to-end Speech Recognition [17.376550014426623]
This paper presents an efficient decoding approach for end-to-end automatic speech recognition (E2E-ASR) with large language models (LLMs)
We propose "delayed fusion," which applies LLM scores to ASR hypotheses with a delay during decoding.
We demonstrate that delayed fusion provides improved decoding speed and accuracy compared to shallow fusion and N-best rescoring.
arXiv Detail & Related papers (2025-01-16T03:01:50Z) - Bridging Speech and Text: Enhancing ASR with Pinyin-to-Character Pre-training in LLMs [20.97172337899685]
We propose pre-training large language models (LLMs) on Pinyin embedding sequences to generate corresponding Chinese characters.
This step enables the LLM to adapt to generating text from pronunciation features before encountering real speech data.
In AISHELL-1 corpus, our approach yields a 9.5% relative improvement in ASR tasks compared to the baseline.
arXiv Detail & Related papers (2024-09-24T12:06:31Z) - Blending LLMs into Cascaded Speech Translation: KIT's Offline Speech Translation System for IWSLT 2024 [61.189875635090225]
Large Language Models (LLMs) are currently under exploration for various tasks, including Automatic Speech Recognition (ASR), Machine Translation (MT), and even End-to-End Speech Translation (ST)
arXiv Detail & Related papers (2024-06-24T16:38:17Z) - One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models [67.49462724595445]
Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs)
We propose a novel method that involves learning scalable and pluggable virtual tokens for RAG.
arXiv Detail & Related papers (2024-05-30T03:44:54Z) - Data Augmentation for Text-based Person Retrieval Using Large Language Models [16.120524750964016]
Text-based Person Retrieval (TPR) aims to retrieve person images that match the description given a text query.
It is difficult to construct a large-scale, high-quality TPR dataset due to expensive annotation and privacy protection.
This paper proposes an LLM-based Data Augmentation (LLM-DA) method for TPR.
arXiv Detail & Related papers (2024-05-20T11:57:50Z) - Unsupervised Information Refinement Training of Large Language Models for Retrieval-Augmented Generation [128.01050030936028]
We propose an information refinement training method named InFO-RAG.
InFO-RAG is low-cost and general across various tasks.
It improves the performance of LLaMA2 by an average of 9.39% relative points.
arXiv Detail & Related papers (2024-02-28T08:24:38Z) - Improving Cross-Domain Low-Resource Text Generation through LLM
Post-Editing: A Programmer-Interpreter Approach [50.400999859808984]
Post-editing has proven effective in improving the quality of text generated by large language models (LLMs)
We propose a neural programmer-interpreter approach that preserves the domain generalization ability of LLMs when editing their output.
Experiments demonstrate that the programmer-interpreter significantly enhances GPT-3.5's performance in logical form-to-text conversion and low-resource machine translation.
arXiv Detail & Related papers (2024-02-07T06:13:14Z) - Correction Focused Language Model Training for Speech Recognition [14.246583065323192]
We introduce a novel correction focused LM training approach which aims to prioritize ASR fallible words.
The word-level ASR fallibility score is defined and shaped as a prior word distribution to guide the LM training.
Compared with conventional LMs, correction focused training achieves up to relatively 5.5% word error rate (WER) reduction in sufficient text scenarios.
arXiv Detail & Related papers (2023-10-17T05:10:39Z) - Prompting Large Language Models for Zero-Shot Domain Adaptation in
Speech Recognition [33.07184218085399]
With only a domain-specific text prompt, we propose two zero-shot ASR domain adaptation methods using LLaMA.
Experiments show that, with only one domain prompt, both methods can effectively reduce word error rates (WER) on out-of-domain TedLium-2 and SPGI datasets.
arXiv Detail & Related papers (2023-06-28T08:29:00Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.