Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions
- URL: http://arxiv.org/abs/2406.14701v1
- Date: Thu, 20 Jun 2024 19:50:49 GMT
- Title: Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions
- Authors: Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Neeraj Gaur, Zhong Meng,
- Abstract summary: We find that optimizing speech prefixes leads to better ASR performance and propose applying RNNT loss to perform speech prefix-tuning.
Our recognition results on an average of 10 Indics show that the proposed prefix-tuning with RNNT loss results in a 12% relative improvement in WER over the baseline with a fine-tuned LLM.
- Score: 28.211967723403987
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we focus on addressing the constraints faced when applying LLMs to ASR. Recent works utilize prefixLM-type models, which directly apply speech as a prefix to LLMs for ASR. We have found that optimizing speech prefixes leads to better ASR performance and propose applying RNNT loss to perform speech prefix-tuning. This is a simple approach and does not increase the model complexity or alter the inference pipeline. We also propose language-based soft prompting to further improve with frozen LLMs. Empirical analysis on realtime testset from 10 Indic languages demonstrate that our proposed speech prefix-tuning yields improvements with both frozen and fine-tuned LLMs. Our recognition results on an average of 10 Indics show that the proposed prefix-tuning with RNNT loss results in a 12\% relative improvement in WER over the baseline with a fine-tuned LLM. Our proposed approches with the frozen LLM leads to a 31\% relative improvement over basic soft-prompting prefixLM.
Related papers
- RAC: Efficient LLM Factuality Correction with Retrieval Augmentation [8.207682890286957]
Large Language Models (LLMs) exhibit impressive results across a wide range of natural language processing (NLP) tasks, yet they can often produce factually incorrect outputs.
This paper introduces a simple but effective low-latency post-correction method, textbfRetrieval Augmented Correction (RAC), aimed at enhancing the factual performance of LLMs without requiring additional fine-tuning.
arXiv Detail & Related papers (2024-10-21T06:11:38Z) - Bridging Speech and Text: Enhancing ASR with Pinyin-to-Character Pre-training in LLMs [20.97172337899685]
We propose pre-training large language models (LLMs) on Pinyin embedding sequences to generate corresponding Chinese characters.
This step enables the LLM to adapt to generating text from pronunciation features before encountering real speech data.
In AISHELL-1 corpus, our approach yields a 9.5% relative improvement in ASR tasks compared to the baseline.
arXiv Detail & Related papers (2024-09-24T12:06:31Z) - Rethinking Semantic Parsing for Large Language Models: Enhancing LLM Performance with Semantic Hints [20.844061807562436]
We propose SENSE, a novel prompting approach that embeds semantic hints within the prompt.
Experiments show that SENSE consistently improves LLMs' performance across various tasks.
arXiv Detail & Related papers (2024-09-22T14:35:09Z) - From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning [89.9648814145473]
Large Language Models (LLMs) tend to prioritize adherence to user prompts over providing veracious responses.
Recent works propose to employ supervised fine-tuning (SFT) to mitigate the sycophancy issue.
We propose a novel supervised pinpoint tuning (SPT), where the region-of-interest modules are tuned for a given objective.
arXiv Detail & Related papers (2024-09-03T07:01:37Z) - ProGRes: Prompted Generative Rescoring on ASR n-Best [10.536469576235223]
Large Language Models (LLMs) have shown their ability to improve the performance of speech recognizers by effectively rescoring the n-best hypotheses generated during the beam search process.
This paper proposes a novel method that uses instruction-tuned LLMs to dynamically expand the n-best speech recognition hypotheses with new hypotheses generated through appropriately-prompted LLMs.
arXiv Detail & Related papers (2024-08-30T19:14:17Z) - Blending LLMs into Cascaded Speech Translation: KIT's Offline Speech Translation System for IWSLT 2024 [61.189875635090225]
Large Language Models (LLMs) are currently under exploration for various tasks, including Automatic Speech Recognition (ASR), Machine Translation (MT), and even End-to-End Speech Translation (ST)
arXiv Detail & Related papers (2024-06-24T16:38:17Z) - One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models [67.49462724595445]
Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs)
We propose a novel method that involves learning scalable and pluggable virtual tokens for RAG.
arXiv Detail & Related papers (2024-05-30T03:44:54Z) - Prompt Perturbation in Retrieval-Augmented Generation based Large Language Models [9.688626139309013]
Retrieval-Augmented Generation is considered as a means to improve the trustworthiness of text generation from large language models.
In this work, we find that the insertion of even a short prefix to the prompt leads to the generation of outputs far away from factually correct answers.
We introduce a novel optimization technique called Gradient Guided Prompt Perturbation.
arXiv Detail & Related papers (2024-02-11T12:25:41Z) - Large Language Models are Efficient Learners of Noise-Robust Speech
Recognition [65.95847272465124]
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR)
In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER.
Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate.
arXiv Detail & Related papers (2024-01-19T01:29:27Z) - Towards ASR Robust Spoken Language Understanding Through In-Context
Learning With Word Confusion Networks [68.79880423713597]
We introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis.
Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts.
arXiv Detail & Related papers (2024-01-05T17:58:10Z) - LLMRefine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback [65.84061725174269]
Recent large language models (LLM) are leveraging human feedback to improve their generation quality.
We propose LLMRefine, an inference time optimization method to refine LLM's output.
We conduct experiments on three text generation tasks, including machine translation, long-form question answering (QA), and topical summarization.
LLMRefine consistently outperforms all baseline approaches, achieving improvements up to 1.7 MetricX points on translation tasks, 8.1 ROUGE-L on ASQA, 2.2 ROUGE-L on topical summarization.
arXiv Detail & Related papers (2023-11-15T19:52:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.