FastSLM: Hierarchical Frame Q-Former for Effective Speech Modality Adaptation
- URL: http://arxiv.org/abs/2601.06199v1
- Date: Thu, 08 Jan 2026 07:46:03 GMT
- Title: FastSLM: Hierarchical Frame Q-Former for Effective Speech Modality Adaptation
- Authors: Junseok Lee, Sangyong Lee, Chang-Jae Chun,
- Abstract summary: FastSLM is a lightweight yet efficient speech-language model (SLM) designed for effective understanding and reasoning over long-form speech.<n>We present a novel three-stage training strategy that enhances generalization across a wide range of speech-related tasks.<n> Experimental results demonstrate that FastSLM achieves competitive performance compared to existing state-of-the-art models.
- Score: 3.8125534288516683
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in large language models (LLMs) have demonstrated human-expert-level capabilities, driving significant interest in their potential for achieving artificial general intelligence (AGI). In particular, there is growing momentum in adapting LLMs to various modalities, including vision, video, and speech, through the development of multimodal LLMs (MLLMs). However, existing speech-language model (SLM) research has largely overlooked cost-effective adaptation strategies for leveraging LLMs in the speech domain. In this paper, we propose FastSLM, a lightweight yet efficient SLM designed for effective understanding and reasoning over long-form speech. To address the challenge of aligning high-frame-rate speech features with LLMs, we introduce the Hierarchical Frame Querying Transformer (HFQ-Former), which compresses frame-level speech features while capturing both local and global context. Furthermore, we present a novel three-stage training strategy that enhances generalization across a wide range of speech-related tasks. Experimental results demonstrate that FastSLM achieves competitive performance compared to existing state-of-the-art models, despite operating with significantly lower FLOPs and parameter counts, while representing speech with only 1.67 tokens per second. The source code and model checkpoints are available at https://huggingface.co/okestro-ai-lab/FastSLM.
Related papers
- Equipping LLM with Directional Multi-Talker Speech Understanding Capabilities [20.51281468416298]
We propose two novel approaches to integrate directivity into large language models (LLM)<n>All of the approaches utilize a multi-microphone array embedded in smart glasses to optimize directivity interpretation and processing in a streaming manner.
arXiv Detail & Related papers (2026-02-06T21:45:48Z) - EmoSLLM: Parameter-Efficient Adaptation of LLMs for Speech Emotion Recognition [0.0]
Emotion recognition from speech is a challenging task that requires capturing both linguistic and paralinguistic cues.<n>Recent works have highlighted the ability of Large Language Models (LLMs) to perform tasks outside of the sole natural language area.<n>This work proposes a novel approach that fine-tunes an LLM with audio and text representations for emotion prediction.
arXiv Detail & Related papers (2025-08-19T06:58:16Z) - LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation [72.02635550088546]
This work explores how large language models (LLMs) can enhance CLIP's capability, especially for processing longer and more complex image captions.<n>We introduce a caption-to-caption contrastive fine-tuning framework, significantly enhancing the discriminative quality of LLM outputs.<n>Our approach outperforms LoRA-based methods, achieving nearly fourfold faster training with superior performance.
arXiv Detail & Related papers (2024-11-07T18:59:16Z) - Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation [14.746190461312036]
Large language models (LLMs) have revolutionized natural language processing (NLP)
We introduce a text-to-speech (TTS) system powered by a fine-tuned Llama model, named TTS-Llama, that achieves state-of-the-art speech synthesis performance.
We further propose MoLE-Llama, a text-and-speech multimodal LLM developed through purely late-fusion parameter-efficient fine-tuning (PEFT) and a mixture-of-expert architecture.
arXiv Detail & Related papers (2024-10-27T04:28:57Z) - DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)<n>We present a simple yet effective automatic process for creating speech-text pair data.<n>Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks [94.10497337235083]
We are first to explore the potential of prompting speech LMs in the domain of speech processing.
We reformulate speech processing tasks into speech-to-unit generation tasks.
We show that the prompting method can achieve competitive performance compared to the strong fine-tuning method.
arXiv Detail & Related papers (2024-08-23T13:00:10Z) - BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5 [25.644228338604815]
We propose BESTOW architecture to bring the BESt features from TwO Worlds into a single model that is highly efficient and has strong multitask capabilities.
We reformulate streamable SpeechLLM as a read-write policy problem and unifies the offline and streaming research with BESTOW architecture.
arXiv Detail & Related papers (2024-06-28T14:40:03Z) - ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation.
How to effectively encode and understand videos in video-based dialogue systems remains to be solved.
We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z) - Boosting Large Language Model for Speech Synthesis: An Empirical Study [86.89548753080432]
Large language models (LLMs) have made significant advancements in natural language processing and are concurrently extending the language ability to other modalities, such as speech and vision.
We conduct a comprehensive empirical exploration of boosting LLMs with the ability to generate speech, by combining pre-trained LLM LLaMA/OPT and text-to-speech synthesis model VALL-E.
We compare three integration methods between LLMs and speech models, including directly fine-tuned LLMs, superposed layers of LLMs and VALL-E, and coupled LLMs and VALL-E using LLMs as a powerful text encoder
arXiv Detail & Related papers (2023-12-30T14:20:04Z) - SLM: Bridge the thin gap between speech and text foundation models [45.319071954143325]
Speech and Language Model (SLM) is a multitask, multilingual, and dual-modal model that takes advantage of pretrained foundational speech and language models.
We show that SLM is efficient to train, but also inherits strong capabilities already acquired in foundation models of different modalities.
arXiv Detail & Related papers (2023-09-30T02:27:45Z) - SpeechGen: Unlocking the Generative Power of Speech Language Models with
Prompts [108.04306136086807]
We present research that explores the application of prompt tuning to stimulate speech LMs for various generation tasks, within a unified framework called SpeechGen.
The proposed unified framework holds great promise for efficiency and effectiveness, particularly with the imminent arrival of advanced speech LMs.
arXiv Detail & Related papers (2023-06-03T22:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.