Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model
in End-to-End Speech Recognition
- URL: http://arxiv.org/abs/2309.10524v1
- Date: Tue, 19 Sep 2023 11:10:50 GMT
- Title: Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model
in End-to-End Speech Recognition
- Authors: Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi
- Abstract summary: We present a novel integration of an instruction-tuned large language model (LLM) and end-to-end automatic speech recognition (ASR)
We explore using this zero-shot capability of LLMs to extract linguistic information that can contribute to improving ASR performance.
- Score: 26.043533280932603
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a novel integration of an instruction-tuned large language model
(LLM) and end-to-end automatic speech recognition (ASR). Modern LLMs can
perform a wide range of linguistic tasks within zero-shot learning when
provided with a precise instruction or a prompt to guide the text generation
process towards the desired task. We explore using this zero-shot capability of
LLMs to extract linguistic information that can contribute to improving ASR
performance. Specifically, we direct an LLM to correct grammatical errors in an
ASR hypothesis and harness the embedded linguistic knowledge to conduct
end-to-end ASR. The proposed model is built on the hybrid connectionist
temporal classification (CTC) and attention architecture, where an
instruction-tuned LLM (i.e., Llama2) is employed as a front-end of the decoder.
An ASR hypothesis, subject to correction, is obtained from the encoder via CTC
decoding, which is then fed into the LLM along with an instruction. The decoder
subsequently takes as input the LLM embeddings to perform sequence generation,
incorporating acoustic information from the encoder output. Experimental
results and analyses demonstrate that the proposed integration yields promising
performance improvements, and our approach largely benefits from LLM-based
rescoring.
Related papers
- GEM: Empowering LLM for both Embedding Generation and Language Understanding [11.081595808236239]
We propose Generative Embedding large language Model (GEM) to generate high-quality text embeddings.<n>Our method inserts new special token(s) into a text body, and generates summarization embedding of the text by manipulating the attention mask.<n>Our results indicate that our approach can empower LLMs with state-of-the-art text embedding capabilities while maintaining their original NLP performance.
arXiv Detail & Related papers (2025-06-04T18:02:07Z) - LegoSLM: Connecting LLM with Speech Encoder using CTC Posteriors [22.845623101142483]
We propose a new paradigm, LegoSLM, that bridges speech encoders and Large Language Models (LLMs)<n>Using the well-performing USM and Gemma models as an example, we demonstrate that our proposed LegoSLM method yields good performance on both ASR and speech translation tasks.
arXiv Detail & Related papers (2025-05-16T15:15:19Z) - Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs)<n>We find that fine-tuning text embedding models on LLM-generated texts yields excellent classification accuracy.<n>We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z) - LARR: Large Language Model Aided Real-time Scene Recommendation with Semantic Understanding [19.510385758079966]
Large Language Model Aided Real-time Scene Recommendation(LARR)
This paper introduces Large Language Model Aided Real-time Scene Recommendation(LARR)
arXiv Detail & Related papers (2024-08-21T10:56:26Z) - Investigating Decoder-only Large Language Models for Speech-to-text Translation [39.17113782374464]
Large language models (LLMs) are known for their exceptional reasoning capabilities, generalizability, and fluency across diverse domains.
We propose a decoder-only architecture that enables the LLM to directly consume the encoded speech representation and generate the text translation.
Our model achieves state-of-the-art performance on CoVoST 2 and FLEURS among models trained without proprietary data.
arXiv Detail & Related papers (2024-07-03T14:42:49Z) - CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities.
We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution.
We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z) - Unsupervised Information Refinement Training of Large Language Models for Retrieval-Augmented Generation [128.01050030936028]
We propose an information refinement training method named InFO-RAG.
InFO-RAG is low-cost and general across various tasks.
It improves the performance of LLaMA2 by an average of 9.39% relative points.
arXiv Detail & Related papers (2024-02-28T08:24:38Z) - An Embarrassingly Simple Approach for LLM with Strong ASR Capacity [56.30595787061546]
We focus on solving one of the most important tasks in the field of speech processing, with speech foundation encoders and large language models (LLM)
Recent works have complex designs such as compressing the output temporally for the speech encoder, tackling modal alignment for the projector, and utilizing parameter-efficient fine-tuning for the LLM.
We found that delicate designs are not necessary, while an embarrassingly simple composition of off-the-shelf speech encoder, LLM, and the only trainable linear projector is competent for the ASR task.
arXiv Detail & Related papers (2024-02-13T23:25:04Z) - Boosting Large Language Model for Speech Synthesis: An Empirical Study [86.89548753080432]
Large language models (LLMs) have made significant advancements in natural language processing and are concurrently extending the language ability to other modalities, such as speech and vision.
We conduct a comprehensive empirical exploration of boosting LLMs with the ability to generate speech, by combining pre-trained LLM LLaMA/OPT and text-to-speech synthesis model VALL-E.
We compare three integration methods between LLMs and speech models, including directly fine-tuned LLMs, superposed layers of LLMs and VALL-E, and coupled LLMs and VALL-E using LLMs as a powerful text encoder
arXiv Detail & Related papers (2023-12-30T14:20:04Z) - Speech Translation with Large Language Models: An Industrial Practice [64.5419534101104]
We introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained large language model (LLM)
By integrating the large language model (LLM) with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations.
Through rigorous experimentation on English and Chinese datasets, we showcase the exceptional performance of LLM-ST.
arXiv Detail & Related papers (2023-12-21T05:32:49Z) - Generative Context-aware Fine-tuning of Self-supervised Speech Models [54.389711404209415]
We study the use of generative large language models (LLM) generated context information.
We propose an approach to distill the generated information during fine-tuning of self-supervised speech models.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: automatic speech recognition, named entity recognition, and sentiment analysis.
arXiv Detail & Related papers (2023-12-15T15:46:02Z) - Exploring the Integration of Large Language Models into Automatic Speech
Recognition Systems: An Empirical Study [0.0]
This paper explores the integration of Large Language Models (LLMs) into Automatic Speech Recognition (ASR) systems.
Our primary focus is to investigate the potential of using an LLM's in-context learning capabilities to enhance the performance of ASR systems.
arXiv Detail & Related papers (2023-07-13T02:31:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.