FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing
- URL: http://arxiv.org/abs/2507.14815v1
- Date: Sun, 20 Jul 2025 04:11:06 GMT
- Title: FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing
- Authors: Shoutao Guo, Shaolei Zhang, Qingkai Fang, Zhengrui Ma, Min Zhang, Yang Feng,
- Abstract summary: FastLongSpeech is designed to extend LSLM capabilities for efficient long-speech processing.<n>It incorporates an iterative fusion strategy that can compress excessively long-speech sequences into manageable lengths.<n>Our method exhibits strong performance in both long-speech and short-speech tasks, while greatly improving inference efficiency.
- Score: 48.84039953531356
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid advancement of Large Language Models (LLMs) has spurred significant progress in Large Speech-Language Models (LSLMs), enhancing their capabilities in both speech understanding and generation. While existing LSLMs often concentrate on augmenting speech generation or tackling a diverse array of short-speech tasks, the efficient processing of long-form speech remains a critical yet underexplored challenge. This gap is primarily attributed to the scarcity of long-speech training datasets and the high computational costs associated with long sequences. To address these limitations, we introduce FastLongSpeech, a novel framework designed to extend LSLM capabilities for efficient long-speech processing without necessitating dedicated long-speech training data. FastLongSpeech incorporates an iterative fusion strategy that can compress excessively long-speech sequences into manageable lengths. To adapt LSLMs for long-speech inputs, it introduces a dynamic compression training approach, which exposes the model to short-speech sequences at varying compression ratios, thereby transferring the capabilities of LSLMs to long-speech tasks. To assess the long-speech capabilities of LSLMs, we develop a long-speech understanding benchmark called LongSpeech-Eval. Experiments show that our method exhibits strong performance in both long-speech and short-speech tasks, while greatly improving inference efficiency.
Related papers
- End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering [33.675277272634666]
CLSR is an end-to-end contrastive language-speech retriever.<n>It efficiently extracts question-relevant segments from long audio recordings for downstream SQA task.
arXiv Detail & Related papers (2025-11-12T12:49:30Z) - Extending Audio Context for Long-Form Understanding in Large Audio-Language Models [13.333718377388713]
Partial YaRN is a training-free, audio-only context extension method for large audio-language models (LALMs)<n>VLAT simulates diverse audio lengths during training, enabling generalization to inputs far longer than those seen in training.<n>Our experiments on SALMONN and Qwen2-Audio show that Partial YaRN outperforms the original models across wide range of settings.
arXiv Detail & Related papers (2025-10-17T01:44:28Z) - MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance [66.74042564585942]
MOSS-Speech is a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance.<n>Our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.
arXiv Detail & Related papers (2025-10-01T04:32:37Z) - SpeechOp: Inference-Time Task Composition for Generative Speech Processing [41.5053493629172]
SpeechOp is a universal speech processor capable of performing a wide range of speech tasks.<n>Implicit Task Composition helps SpeechOp's enhancement via our principled inference-time task composition.
arXiv Detail & Related papers (2025-09-17T05:05:55Z) - Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model [76.06585781346601]
Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model.<n>The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality.
arXiv Detail & Related papers (2025-06-04T23:53:49Z) - Long-Form Speech Generation with Spoken Language Models [64.29591880693468]
Textless spoken language models struggle to generate plausible speech past tens of seconds.<n>We derive SpeechSSM, the first speech language model family to learn from and sample long-form spoken audio.<n>SpeechSSMs leverage recent advances in linear-time sequence modeling to greatly surpass current Transformer spoken LMs in coherence and efficiency.
arXiv Detail & Related papers (2024-12-24T18:56:46Z) - Breaking the Stage Barrier: A Novel Single-Stage Approach to Long Context Extension for Large Language Models [28.253786579346432]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP)<n>Currently solutions toward long context modeling often employ multi-stage continual pertaining.<n>In this paper, we introduce a novel single-stage continual pretraining method, Head-Adaptive Rotary Position.
arXiv Detail & Related papers (2024-12-10T04:09:29Z) - Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation [14.746190461312036]
Large language models (LLMs) have revolutionized natural language processing (NLP)
We introduce a text-to-speech (TTS) system powered by a fine-tuned Llama model, named TTS-Llama, that achieves state-of-the-art speech synthesis performance.
We further propose MoLE-Llama, a text-and-speech multimodal LLM developed through purely late-fusion parameter-efficient fine-tuning (PEFT) and a mixture-of-expert architecture.
arXiv Detail & Related papers (2024-10-27T04:28:57Z) - DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)<n>We present a simple yet effective automatic process for creating speech-text pair data.<n>Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM [3.6950912517562435]
We propose a method that implicitly internalizes ASR chain of thought into a speech LLM, enhancing its native speech understanding capabilities.
Our approach reduces latency and improves the model's native understanding of speech, paving the way for more efficient and natural real-time audio interactions.
arXiv Detail & Related papers (2024-09-25T20:59:12Z) - InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory [93.20588235940453]
In this paper, we introduce a training-free memory-based method, InfLLM.
InfLLM stores distant contexts into additional memory units and employs an efficient mechanism to lookup token-relevant units for attention.
Even when the sequence length is scaled to $1,024$K, InfLLM still effectively captures long-distance dependencies.
arXiv Detail & Related papers (2024-02-07T06:50:42Z) - SpeechGen: Unlocking the Generative Power of Speech Language Models with
Prompts [108.04306136086807]
We present research that explores the application of prompt tuning to stimulate speech LMs for various generation tasks, within a unified framework called SpeechGen.
The proposed unified framework holds great promise for efficiency and effectiveness, particularly with the imminent arrival of advanced speech LMs.
arXiv Detail & Related papers (2023-06-03T22:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.