Related papers: Closing the Modality Reasoning Gap for Speech Large Language Models

Closing the Modality Reasoning Gap for Speech Large Language Models

URL: http://arxiv.org/abs/2601.05543v1
Date: Fri, 09 Jan 2026 05:51:56 GMT
Title: Closing the Modality Reasoning Gap for Speech Large Language Models
Authors: Chaoren Wang, Heng Lu, Xueyao Zhang, Shujie Liu, Yan Lu, Jinyu Li, Zhizheng Wu,
Abstract summary: TARS is a reinforcement-learning framework that aligns text-conditioned and speech-conditioned trajectories.<n>Our approach significantly narrows the modality reasoning gap and achieves state-of-the-art performance among 7B-scale Speech LLMs.
Score: 33.22455377292432
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Although speech large language models have achieved notable progress, a substantial modality reasoning gap remains: their reasoning performance on speech inputs is markedly weaker than on text. This gap could be associated with representational drift across Transformer layers and behavior deviations in long-chain reasoning. To address this issue, we introduce TARS, a reinforcement-learning framework that aligns text-conditioned and speech-conditioned trajectories through an asymmetric reward design. The framework employs two dense and complementary signals: representation alignment, which measures layer-wise hidden-state similarity between speech- and text-conditioned trajectories, and behavior alignment, which evaluates semantic consistency between generated outputs and reference text completions. Experiments on challenging reasoning benchmarks, including MMSU and OBQA, show that our approach significantly narrows the modality reasoning gap and achieves state-of-the-art performance among 7B-scale Speech LLMs.

Related papers

Anatomy of the Modality Gap: Dissecting the Internal States of End-to-End Speech LLMs [15.914430317382077]
We analyze how speech and text representations evolve layer-by-layer.<n>We find that speech representations exhibit a broad cross-layer alignment band, attributable to the redundant nature of speech.
arXiv Detail & Related papers (2026-03-02T06:21:43Z)
TagSpeech: End-to-End Multi-Speaker ASR and Diarization with Fine-Grained Temporal Grounding [15.908533215017059]
We present TagSpeech, a unified framework that utilizes Temporal Anchor Grounding for joint multi-speaker ASR and diarization.<n>The framework is built on two key designs: (1) decoupled semantic and speaker streams fine-tuned via Serialized Output Training (SOT) to learn turn-taking dynamics; and (2) an interleaved time anchor mechanism that acts as a synchronization signal between semantic understanding and speaker tracking.
arXiv Detail & Related papers (2026-01-11T12:40:07Z)
Stable Language Guidance for Vision-Language-Action Models [62.80963701282789]
Residual Semantic Steering is a probabilistic framework that disentangles physical affordance from semantic execution.<n> RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.
arXiv Detail & Related papers (2026-01-07T16:16:10Z)
Closing the Gap Between Text and Speech Understanding in LLMs [28.538793793887223]
Large Language Models can be adapted to extend their text capabilities to speech inputs.<n>These speech-adapted LLMs consistently underperform their text-based counterparts.<n>We introduce SALAD--Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation.
arXiv Detail & Related papers (2025-10-15T14:57:16Z)
Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models [12.263637152835713]
End-to-end Large Speech Language Models (LSLMs) have demonstrated impressive conversational generation abilities.<n>We analyze both coarse- and fine-grained text and speech representations.<n>We find that representation similarity is strongly correlated with the modality gap.
arXiv Detail & Related papers (2025-10-14T03:34:38Z)
MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance [66.74042564585942]
MOSS-Speech is a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance.<n>Our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.
arXiv Detail & Related papers (2025-10-01T04:32:37Z)
Explaining multimodal LLMs via intra-modal token interactions [55.27436637894534]
Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood.<n>We propose enhancing interpretability by leveraging intra-modal interaction.
arXiv Detail & Related papers (2025-09-26T14:39:13Z)
Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models [80.75260664100644]
Mini-Omni-Reasoner is a framework that enables reasoning within speech via a novel "Thinking-in-Speaking" formulation.<n>It interleaves silent reasoning tokens with spoken response tokens at the token level.<n>It achieves a +19.1% gain in arithmetic reasoning and +6.4% in contextual understanding, with shorter outputs and zero decoding latency.
arXiv Detail & Related papers (2025-08-18T15:14:04Z)
Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model [76.06585781346601]
Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model.<n>The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality.
arXiv Detail & Related papers (2025-06-04T23:53:49Z)
Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation [20.415410280412697]
We propose an Adaptive Inner Speech-Text Alignment (AI-STA) method to bridge the modality gap by explicitly aligning speech and text representations at selected layers within large language models (LLMs)<n> Experimental results on speech translation tasks demonstrate that AI-STA significantly improves the translation performance of large speech-text models (LSMs), outperforming previous state-of-the-art approaches.
arXiv Detail & Related papers (2025-03-13T09:54:35Z)
Improving Joint Speech-Text Representations Without Alignment [92.60384956736536]
We show that joint speech-text encoders naturally achieve consistent representations across modalities by disregarding sequence length. We argue that consistency losses could forgive length differences and simply assume the best alignment.
arXiv Detail & Related papers (2023-08-11T13:28:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.