Related papers: Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models

Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models

URL: http://arxiv.org/abs/2510.12116v1
Date: Tue, 14 Oct 2025 03:34:38 GMT
Title: Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models
Authors: Bajian Xiang, Shuaijiang Zhao, Tingwei Guo, Wei Zou,
Abstract summary: End-to-end Large Speech Language Models (LSLMs) have demonstrated impressive conversational generation abilities.<n>We analyze both coarse- and fine-grained text and speech representations.<n>We find that representation similarity is strongly correlated with the modality gap.
Score: 12.263637152835713
License: http://creativecommons.org/licenses/by/4.0/
Abstract: End-to-end Large Speech Language Models (LSLMs) have demonstrated impressive conversational generation abilities, yet consistently fall short of traditional pipeline systems on semantic understanding benchmarks. In this work, we reveal through systematic experimentation that although LSLMs lose some text input performance after speech-text alignment training, the performance gap between speech and text inputs is more pronounced, which we refer to as the modality gap. To understand this gap, we analyze both coarse- and fine-grained text and speech representations. At the coarse-grained level, representations of speech and text in deeper layers are found to be increasingly aligned in direction (cosine similarity), while concurrently diverging in magnitude (Euclidean distance). We further find that representation similarity is strongly correlated with the modality gap. At the fine-grained level, a spontaneous token-level alignment pattern between text and speech representations is observed. Based on this, we introduce the Alignment Path Score to quantify token-level alignment quality, which exhibits stronger correlation with the modality gap. Building on these insights, we design targeted interventions on critical tokens through angle projection and length normalization. These strategies demonstrate the potential to improve correctness for speech inputs. Our study provides the first systematic empirical analysis of the modality gap and alignment mechanisms in LSLMs, offering both theoretical and methodological guidance for future optimization.

Related papers

Anatomy of the Modality Gap: Dissecting the Internal States of End-to-End Speech LLMs [15.914430317382077]
We analyze how speech and text representations evolve layer-by-layer.<n>We find that speech representations exhibit a broad cross-layer alignment band, attributable to the redundant nature of speech.
arXiv Detail & Related papers (2026-03-02T06:21:43Z)
Closing the Modality Reasoning Gap for Speech Large Language Models [33.22455377292432]
TARS is a reinforcement-learning framework that aligns text-conditioned and speech-conditioned trajectories.<n>Our approach significantly narrows the modality reasoning gap and achieves state-of-the-art performance among 7B-scale Speech LLMs.
arXiv Detail & Related papers (2026-01-09T05:51:56Z)
Closing the Gap Between Text and Speech Understanding in LLMs [28.538793793887223]
Large Language Models can be adapted to extend their text capabilities to speech inputs.<n>These speech-adapted LLMs consistently underperform their text-based counterparts.<n>We introduce SALAD--Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation.
arXiv Detail & Related papers (2025-10-15T14:57:16Z)
MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance [66.74042564585942]
MOSS-Speech is a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance.<n>Our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.
arXiv Detail & Related papers (2025-10-01T04:32:37Z)
Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs [59.230858581944425]
Two dominant approaches have emerged for speech processing: discrete tokens and continuous features.<n>We compare self-supervised learning (SSL)-based discrete and continuous features under the same experimental settings.<n>Our findings reveal that continuous features generally outperform discrete tokens in various tasks.
arXiv Detail & Related papers (2025-08-25T10:16:07Z)
What Makes a Good Speech Tokenizer for LLM-Centric Speech Generation? A Systematic Study [58.55905182336196]
Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation.<n>We investigate the role of speech tokenizer designs in LLM-centric SLMs, augmented by speech heads and speaker modeling.<n>We introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens.
arXiv Detail & Related papers (2025-06-14T15:26:31Z)
Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model [76.06585781346601]
Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model.<n>The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality.
arXiv Detail & Related papers (2025-06-04T23:53:49Z)
Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation [20.415410280412697]
We propose an Adaptive Inner Speech-Text Alignment (AI-STA) method to bridge the modality gap by explicitly aligning speech and text representations at selected layers within large language models (LLMs)<n> Experimental results on speech translation tasks demonstrate that AI-STA significantly improves the translation performance of large speech-text models (LSMs), outperforming previous state-of-the-art approaches.
arXiv Detail & Related papers (2025-03-13T09:54:35Z)
SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation [10.828717295018123]
We propose a unified embedding framework that eliminates the need for intermediate text representations.<n>Our model reduces pipeline latency by 50% while achieving higher retrieval accuracy compared to traditional two-stage methods.
arXiv Detail & Related papers (2025-01-26T15:04:02Z)
Improving Joint Speech-Text Representations Without Alignment [92.60384956736536]
We show that joint speech-text encoders naturally achieve consistent representations across modalities by disregarding sequence length. We argue that consistency losses could forgive length differences and simply assume the best alignment.
arXiv Detail & Related papers (2023-08-11T13:28:48Z)
Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts. We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.