Related papers: An End-to-End Speech Summarization Using Large Language Model

An End-to-End Speech Summarization Using Large Language Model

URL: http://arxiv.org/abs/2407.02005v1
Date: Tue, 2 Jul 2024 07:22:57 GMT
Title: An End-to-End Speech Summarization Using Large Language Model
Authors: Hengchao Shang, Zongyao Li, Jiaxin Guo, Shaojun Li, Zhiqiang Rao, Yuanchang Luo, Daimeng Wei, Hao Yang,
Abstract summary: Speech Summarization (SSum) aims to generate human-like text summaries from spoken content. Research on large language models (LLMs) and multimodal information fusion has provided new insights. We propose an end-to-end SSum model that utilizes Q-Former as a connector for the audio-text modality.
Score: 7.562198375754054
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Abstractive Speech Summarization (SSum) aims to generate human-like text summaries from spoken content. It encounters difficulties in handling long speech input and capturing the intricate cross-modal mapping between long speech inputs and short text summaries. Research on large language models (LLMs) and multimodal information fusion has provided new insights for addressing these challenges. In this paper, we propose an end-to-end SSum model that utilizes Q-Former as a connector for the audio-text modality and employs LLMs to generate text summaries directly from speech features. We adopt a multi-stage training approach that includes LLM based ASR and Text Summarization (TSum) tasks as auxiliary tasks. ASR tasks are used to align feature spaces and enhance the LLM's ability to handle longer speech. Then, we utilize a curriculum learning strategy to facilitate the model's transition from TSum to SSum. Finally, our model achieves competitive performance on the How-2 dataset.

Related papers

Unlocking Speech Instruction Data Potential with Query Rewriting [26.134056897363557]
End-to-end Large Speech Language Models(textbfLSLMs) demonstrate strong potential in response latency and speech comprehension capabilities.<n>However, the ability to follow speech instructions has not been fully realized due to the lack of datasets and heavily biased training tasks.<n>We propose a query rewriting framework with multi-LLM knowledge fusion, employing multiple agents to annotate and validate the synthesized speech.
arXiv Detail & Related papers (2025-07-11T13:55:45Z)
A Unified Speech LLM for Diarization and Speech Recognition in Multilingual Conversations [25.58593495281218]
We propose a unified speech LLM that jointly performs diarization and ASR in an end-to-end manner.<n>By reformulating the training data format and modifying the inference procedure, our model addresses the ambiguity inherent in pre-segmented audio.
arXiv Detail & Related papers (2025-06-26T01:54:02Z)
Enhancing Generalization of Speech Large Language Models with Multi-Task Behavior Imitation and Speech-Text Interleaving [36.246791887458194]
Large language models (LLMs) have shown remarkable generalization across tasks.<n>LLMs typically use supervised fine-tuning to align speech with text-based LLMs.<n>We propose a novel multi-task 'behavior imitation' method with speech-text interleaving.
arXiv Detail & Related papers (2025-05-24T11:09:13Z)
TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling [46.60911294356232]
We introduce Text-Aligned Speech Tokenization and Embedding (TASTE) TASTE is a method that directly addresses the modality gap by aligning speech token with the corresponding text transcription during the tokenization stage. We conduct extensive experiments and show that TASTE can preserve essential paralinguistic information while dramatically reducing the token sequence length.
arXiv Detail & Related papers (2025-04-09T17:14:33Z)
Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation [14.746190461312036]
Large language models (LLMs) have revolutionized natural language processing (NLP) We introduce a text-to-speech (TTS) system powered by a fine-tuned Llama model, named TTS-Llama, that achieves state-of-the-art speech synthesis performance. We further propose MoLE-Llama, a text-and-speech multimodal LLM developed through purely late-fusion parameter-efficient fine-tuning (PEFT) and a mixture-of-expert architecture.
arXiv Detail & Related papers (2024-10-27T04:28:57Z)
VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning [64.56272011710735]
We propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the large language models (LLMs) backbone. Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks.
arXiv Detail & Related papers (2024-10-23T00:36:06Z)
Self-Powered LLM Modality Expansion for Large Speech-Text Models [62.27700381806554]
Large language models (LLMs) exhibit remarkable performance across diverse tasks. This study aims to refine the use of speech datasets for LSM training by addressing the limitations of vanilla instruction tuning. We introduce a self-powered LSM that leverages augmented automatic speech recognition data generated by the model itself for more effective instruction tuning.
arXiv Detail & Related papers (2024-10-04T04:34:24Z)
Recent Advances in Speech Language Models: A Survey [45.968078636811356]
Speech Language Models (SpeechLMs) are end-to-end models that generate speech without converting from text. This paper provides the first comprehensive overview of recent methodologies for constructing SpeechLMs.
arXiv Detail & Related papers (2024-10-01T21:48:12Z)
Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs) We present a simple yet effective automatic process for creating speech-text pair data. Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z)
Speech Recognition Rescoring with Large Speech-Text Foundation Models [20.145389016219106]
Large language models (LLM) have demonstrated the ability to understand human language by leveraging large amount of text data. Automatic speech recognition (ASR) systems are often limited by available transcribed speech data. Recent multi-modal large language models have demonstrated strong spoken language understanding.
arXiv Detail & Related papers (2024-09-25T06:17:23Z)
Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments. Our approach utilizes WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context. Comprehensive experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z)
A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision [74.972172804514]
We introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text. New dataset annotations provide continuous sign-level annotations for six hours of test videos, and will be made publicly available. Our model significantly outperforms the previous state of the art on both tasks.
arXiv Detail & Related papers (2024-05-16T17:19:06Z)
SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation [26.778332992311043]
We present a novel Speech Augmented Language Model (SALM) with em multitask and em in-context learning capabilities. The unified SALM achieves performance on par with task-specific Conformer baselines for Automatic Speech Recognition (ASR) and Speech Translation (AST)
arXiv Detail & Related papers (2023-10-13T22:07:33Z)
SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions. Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text. We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.