Related papers: FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing

FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing

URL: http://arxiv.org/abs/2507.14815v1
Date: Sun, 20 Jul 2025 04:11:06 GMT
Title: FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing
Authors: Shoutao Guo, Shaolei Zhang, Qingkai Fang, Zhengrui Ma, Min Zhang, Yang Feng,
Abstract summary: FastLongSpeech is designed to extend LSLM capabilities for efficient long-speech processing.<n>It incorporates an iterative fusion strategy that can compress excessively long-speech sequences into manageable lengths.<n>Our method exhibits strong performance in both long-speech and short-speech tasks, while greatly improving inference efficiency.
Score: 48.84039953531356
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid advancement of Large Language Models (LLMs) has spurred significant progress in Large Speech-Language Models (LSLMs), enhancing their capabilities in both speech understanding and generation. While existing LSLMs often concentrate on augmenting speech generation or tackling a diverse array of short-speech tasks, the efficient processing of long-form speech remains a critical yet underexplored challenge. This gap is primarily attributed to the scarcity of long-speech training datasets and the high computational costs associated with long sequences. To address these limitations, we introduce FastLongSpeech, a novel framework designed to extend LSLM capabilities for efficient long-speech processing without necessitating dedicated long-speech training data. FastLongSpeech incorporates an iterative fusion strategy that can compress excessively long-speech sequences into manageable lengths. To adapt LSLMs for long-speech inputs, it introduces a dynamic compression training approach, which exposes the model to short-speech sequences at varying compression ratios, thereby transferring the capabilities of LSLMs to long-speech tasks. To assess the long-speech capabilities of LSLMs, we develop a long-speech understanding benchmark called LongSpeech-Eval. Experiments show that our method exhibits strong performance in both long-speech and short-speech tasks, while greatly improving inference efficiency.

Related papers

End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering [33.675277272634666]
CLSR is an end-to-end contrastive language-speech retriever.<n>It efficiently extracts question-relevant segments from long audio recordings for downstream SQA task.
arXiv Detail & Related papers (2025-11-12T12:49:30Z)
Extending Audio Context for Long-Form Understanding in Large Audio-Language Models [13.333718377388713]
Partial YaRN is a training-free, audio-only context extension method for large audio-language models (LALMs)<n>VLAT simulates diverse audio lengths during training, enabling generalization to inputs far longer than those seen in training.<n>Our experiments on SALMONN and Qwen2-Audio show that Partial YaRN outperforms the original models across wide range of settings.
arXiv Detail & Related papers (2025-10-17T01:44:28Z)
MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance [66.74042564585942]
MOSS-Speech is a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance.<n>Our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.
arXiv Detail & Related papers (2025-10-01T04:32:37Z)
SpeechOp: Inference-Time Task Composition for Generative Speech Processing [41.5053493629172]
SpeechOp is a universal speech processor capable of performing a wide range of speech tasks.<n>Implicit Task Composition helps SpeechOp's enhancement via our principled inference-time task composition.
arXiv Detail & Related papers (2025-09-17T05:05:55Z)
Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model [76.06585781346601]
Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model.<n>The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality.
arXiv Detail & Related papers (2025-06-04T23:53:49Z)
Long-Form Speech Generation with Spoken Language Models [64.29591880693468]
Textless spoken language models struggle to generate plausible speech past tens of seconds.<n>We derive SpeechSSM, the first speech language model family to learn from and sample long-form spoken audio.<n>SpeechSSMs leverage recent advances in linear-time sequence modeling to greatly surpass current Transformer spoken LMs in coherence and efficiency.
arXiv Detail & Related papers (2024-12-24T18:56:46Z)
Breaking the Stage Barrier: A Novel Single-Stage Approach to Long Context Extension for Large Language Models [28.253786579346432]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP)<n>Currently solutions toward long context modeling often employ multi-stage continual pertaining.<n>In this paper, we introduce a novel single-stage continual pretraining method, Head-Adaptive Rotary Position.
arXiv Detail & Related papers (2024-12-10T04:09:29Z)
Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation [14.746190461312036]
Large language models (LLMs) have revolutionized natural language processing (NLP) We introduce a text-to-speech (TTS) system powered by a fine-tuned Llama model, named TTS-Llama, that achieves state-of-the-art speech synthesis performance. We further propose MoLE-Llama, a text-and-speech multimodal LLM developed through purely late-fusion parameter-efficient fine-tuning (PEFT) and a mixture-of-expert architecture.
arXiv Detail & Related papers (2024-10-27T04:28:57Z)
DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)<n>We present a simple yet effective automatic process for creating speech-text pair data.<n>Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z)
Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM [3.6950912517562435]
We propose a method that implicitly internalizes ASR chain of thought into a speech LLM, enhancing its native speech understanding capabilities. Our approach reduces latency and improves the model's native understanding of speech, paving the way for more efficient and natural real-time audio interactions.
arXiv Detail & Related papers (2024-09-25T20:59:12Z)
InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory [93.20588235940453]
In this paper, we introduce a training-free memory-based method, InfLLM. InfLLM stores distant contexts into additional memory units and employs an efficient mechanism to lookup token-relevant units for attention. Even when the sequence length is scaled to $1,024$K, InfLLM still effectively captures long-distance dependencies.
arXiv Detail & Related papers (2024-02-07T06:50:42Z)
SpeechGen: Unlocking the Generative Power of Speech Language Models with Prompts [108.04306136086807]
We present research that explores the application of prompt tuning to stimulate speech LMs for various generation tasks, within a unified framework called SpeechGen. The proposed unified framework holds great promise for efficiency and effectiveness, particularly with the imminent arrival of advanced speech LMs.
arXiv Detail & Related papers (2023-06-03T22:35:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.