Related papers: Prompting Large Language Models with Audio for General-Purpose Speech Summarization

Prompting Large Language Models with Audio for General-Purpose Speech Summarization

URL: http://arxiv.org/abs/2406.05968v1
Date: Mon, 10 Jun 2024 02:04:28 GMT
Title: Prompting Large Language Models with Audio for General-Purpose Speech Summarization
Authors: Wonjune Kang, Deb Roy,
Abstract summary: We introduce a framework for speech summarization that leverages the processing and reasoning capabilities of large language models (LLMs) We propose an end-to-end system that combines an instruction-tuned LLM with an audio encoder that converts speech into token representations that the LLM can interpret.
Score: 13.415189715216354
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: In this work, we introduce a framework for speech summarization that leverages the processing and reasoning capabilities of large language models (LLMs). We propose an end-to-end system that combines an instruction-tuned LLM with an audio encoder that converts speech into token representations that the LLM can interpret. Using a dataset with paired speech-text data, the overall system is trained to generate consistent responses to prompts with the same semantic information regardless of the input modality. The resulting framework allows the LLM to process speech inputs in the same way as text, enabling speech summarization by simply prompting the LLM. Unlike prior approaches, our method is able to summarize spoken content from any arbitrary domain, and it can produce summaries in different styles by varying the LLM prompting strategy. Experiments demonstrate that our approach outperforms a cascade baseline of speech recognition followed by LLM text processing.

Related papers

Equipping LLM with Directional Multi-Talker Speech Understanding Capabilities [20.51281468416298]
We propose two novel approaches to integrate directivity into large language models (LLM)<n>All of the approaches utilize a multi-microphone array embedded in smart glasses to optimize directivity interpretation and processing in a streaming manner.
arXiv Detail & Related papers (2026-02-06T21:45:48Z)
ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models [70.56468982313834]
We propose ProsodyLM, which introduces a simple tokenization scheme amenable to learning prosody.<n>We find that ProsodyLM can learn surprisingly diverse emerging prosody processing capabilities through pre-training alone.
arXiv Detail & Related papers (2025-07-27T00:59:01Z)
TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling [46.60911294356232]
We introduce Text-Aligned Speech Tokenization and Embedding (TASTE) TASTE is a method that directly addresses the modality gap by aligning speech token with the corresponding text transcription during the tokenization stage. We conduct extensive experiments and show that TASTE can preserve essential paralinguistic information while dramatically reducing the token sequence length.
arXiv Detail & Related papers (2025-04-09T17:14:33Z)
SparQLe: Speech Queries to Text Translation Through LLMs [0.8901073744693314]
This study introduces a novel approach that leverages self-supervised speech representations in combination with instruction-tuned LLMs for speech-to-text translation. Our experiments demonstrate that this method effectively preserves the semantic content of the input speech and serves as an effective bridge between self-supervised speech models and instruction-tuned LLMs.
arXiv Detail & Related papers (2025-02-13T12:57:15Z)
Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech [29.847183061204436]
Large language models (LLMs) can take into account users' emotions or speaking styles when providing their responses. In this work, we utilize an end-to-end system with a speech encoder. We find that this training framework allows the encoder to generate tokens that capture both semantic and paralinguistic information in speech.
arXiv Detail & Related papers (2024-10-02T01:32:47Z)
Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs) We present a simple yet effective automatic process for creating speech-text pair data. Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z)
Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments. Our approach utilizes WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context. Comprehensive experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z)
Boosting Large Language Model for Speech Synthesis: An Empirical Study [86.89548753080432]
Large language models (LLMs) have made significant advancements in natural language processing and are concurrently extending the language ability to other modalities, such as speech and vision. We conduct a comprehensive empirical exploration of boosting LLMs with the ability to generate speech, by combining pre-trained LLM LLaMA/OPT and text-to-speech synthesis model VALL-E. We compare three integration methods between LLMs and speech models, including directly fine-tuned LLMs, superposed layers of LLMs and VALL-E, and coupled LLMs and VALL-E using LLMs as a powerful text encoder
arXiv Detail & Related papers (2023-12-30T14:20:04Z)
Generative Context-aware Fine-tuning of Self-supervised Speech Models [54.389711404209415]
We study the use of generative large language models (LLM) generated context information. We propose an approach to distill the generated information during fine-tuning of self-supervised speech models. We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: automatic speech recognition, named entity recognition, and sentiment analysis.
arXiv Detail & Related papers (2023-12-15T15:46:02Z)
AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs [27.122094554340194]
We extend the instruction-tuned Llama-2 model with end-to-end general-purpose speech processing and reasoning abilities. The resulting end-to-end model, named AudioChatLlama, can utilize audio prompts as a replacement for text and sustain a conversation.
arXiv Detail & Related papers (2023-11-12T06:56:14Z)
Instruction-Following Speech Recognition [21.591086644665197]
We introduce instruction-following speech recognition, training a Listen-Attend-Spell model to understand and execute a diverse set of free-form text instructions. Remarkably, our model, trained from scratch on Librispeech, interprets and executes simple instructions without requiring Large Language Models or pre-trained speech modules.
arXiv Detail & Related papers (2023-09-18T14:59:10Z)
BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing [35.31866559807704]
modality alignment between speech and text remains an open problem. We propose the BLSP approach that bootstraps Language-Speech Pre-training via behavior alignment of continuation writing. We demonstrate that this straightforward process can extend the capabilities of LLMs to speech, enabling speech recognition, speech translation, spoken language understanding, and speech conversation, even in zero-shot cross-lingual scenarios.
arXiv Detail & Related papers (2023-09-02T11:46:05Z)
On decoder-only architecture for speech-to-text and large language model integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z)
VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model) The proposed VATLM employs a unified backbone network to model the modality-independent information. In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.