Related papers: CORD: Bridging the Audio-Text Reasoning Gap via Weighted On-policy Cross-modal Distillation

CORD: Bridging the Audio-Text Reasoning Gap via Weighted On-policy Cross-modal Distillation

URL: http://arxiv.org/abs/2601.16547v1
Date: Fri, 23 Jan 2026 08:31:24 GMT
Title: CORD: Bridging the Audio-Text Reasoning Gap via Weighted On-policy Cross-modal Distillation
Authors: Jing Hu, Danxiang Zhu, Xianlong Luo, Dan Zhang, Shuwei He, Yishu Lei, Haitao Zheng, Shikun Feng, Jingzhou He, Yu Sun, Hua Wu, Haifeng Wang,
Abstract summary: We propose CORD, a unified alignment framework that performs online cross-modal self-distillation.<n>Specifically, it aligns audio-conditioned reasoning with its text-conditioned counterpart within a unified model.<n> Empirical results across multiple benchmarks demonstrate that CORD consistently enhances audio-conditioned reasoning.
Score: 32.72685791637924
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Audio Language Models (LALMs) have garnered significant research interest. Despite being built upon text-based large language models (LLMs), LALMs frequently exhibit a degradation in knowledge and reasoning capabilities. We hypothesize that this limitation stems from the failure of current training paradigms to effectively bridge the acoustic-semantic gap within the feature representation space. To address this challenge, we propose CORD, a unified alignment framework that performs online cross-modal self-distillation. Specifically, it aligns audio-conditioned reasoning with its text-conditioned counterpart within a unified model. Leveraging the text modality as an internal teacher, CORD performs multi-granularity alignment throughout the audio rollout process. At the token level, it employs on-policy reverse KL divergence with importance-aware weighting to prioritize early and semantically critical tokens. At the sequence level, CORD introduces a judge-based global reward to optimize complete reasoning trajectories via Group Relative Policy Optimization (GRPO). Empirical results across multiple benchmarks demonstrate that CORD consistently enhances audio-conditioned reasoning and substantially bridges the audio-text performance gap with only 80k synthetic training samples, validating the efficacy and data efficiency of our on-policy, multi-level cross-modal alignment approach.

Related papers

Anatomy of the Modality Gap: Dissecting the Internal States of End-to-End Speech LLMs [15.914430317382077]
We analyze how speech and text representations evolve layer-by-layer.<n>We find that speech representations exhibit a broad cross-layer alignment band, attributable to the redundant nature of speech.
arXiv Detail & Related papers (2026-03-02T06:21:43Z)
Covo-Audio Technical Report [61.09708870154148]
Covo-Audio, a 7B-end LALM, directly processes continuous audio inputs and generates audio outputs within a single unified architecture.<n>Covo-Audio-Chat, a dialogue-oriented variant, demonstrates semantic strong spoken conversational abilities.
arXiv Detail & Related papers (2026-02-10T14:31:11Z)
PACE: Pretrained Audio Continual Learning [27.605574463021693]
We present the first systematic benchmark for audio continual learning (CL) with pretrained models (PTMs)<n>In addition, we introduce spectrogram-based boundary-aware perturbations to mitigate representation overlap and improve stability.<n>Experiments on six diverse audio CL benchmarks demonstrate that PACE substantially outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-03T10:28:35Z)
Optimizing Conversational Quality in Spoken Dialogue Systems with Reinforcement Learning from AI Feedback [82.70507055599093]
We present the first systematic study of preference learning for improving SDS quality in both multi-turn Chain-of-Thought and blockwise duplex models.<n>Experiments show that single-reward RLAIF selectively improves its targeted metric, while joint multi-reward training yields consistent gains across semantic quality and audio naturalness.
arXiv Detail & Related papers (2026-01-27T00:55:14Z)
Joint Multimodal Contrastive Learning for Robust Spoken Term Detection and Keyword Spotting [13.48022380380599]
We propose a joint multimodal contrastive learning framework that unifies acoustic and cross-modal supervision in a shared embedding space.<n>Our approach simultaneously optimize: (i) audio-text contrastive learning, inspired by the CLAP loss, to align audio and text representations and (ii) audio-audio contrastive learning, via Deep Word Discrimination (DWD) loss, to enhance intra-class compactness and inter-class separation.<n>The proposed method outperforms existing AWE baselines on word discrimination task while flexibly supporting both STD and KWS.
arXiv Detail & Related papers (2025-12-16T05:58:25Z)
DELULU: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Supervised Speech Foundational Model [65.93900011975238]
DELULU is a speaker-aware self-supervised foundational model for verification, diarization, and profiling applications.<n>It is trained using a dual objective that combines masked prediction and denoising, further enhancing robustness and generalization.<n>Our findings demonstrate that DELULU is a strong universal encoder for speaker-aware speech processing, enabling superior performance even without task-specific fine-tuning.
arXiv Detail & Related papers (2025-10-20T15:35:55Z)
Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models [12.263637152835713]
End-to-end Large Speech Language Models (LSLMs) have demonstrated impressive conversational generation abilities.<n>We analyze both coarse- and fine-grained text and speech representations.<n>We find that representation similarity is strongly correlated with the modality gap.
arXiv Detail & Related papers (2025-10-14T03:34:38Z)
When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models [18.160420407067743]
MCR-BENCH is the first benchmark designed to evaluate how LALMs prioritize information when presented with inconsistent audio-text pairs.<n>We reveal a concerning phenomenon: when inconsistencies exist between modalities, LALMs display a significant bias toward textual input.<n>This tendency leads to substantial performance degradation in audio-centric tasks and raises important reliability concerns for real-world applications.
arXiv Detail & Related papers (2025-08-21T09:58:24Z)
AURORA: Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation [113.75682363364004]
AURORA is a framework designed to enhance genuine reasoning and language comprehension in reference audio-visual segmentation.<n>AURORA achieves state-of-the-art performance on Ref-AVS benchmarks and generalizes effectively to unreferenced segmentation.
arXiv Detail & Related papers (2025-08-04T07:47:38Z)
Implicit Counterfactual Learning for Audio-Visual Segmentation [50.69377287012591]
We propose the implicit counterfactual framework (ICF) to achieve unbiased cross-modal understanding.<n>Due to the lack of semantics, heterogeneous representations may lead to erroneous matches.<n>We introduce the multi-granularity implicit text (MIT) involving video-, segment- and frame-level as the bridge to establish the modality-shared space.
arXiv Detail & Related papers (2025-07-28T11:46:35Z)
Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model [76.06585781346601]
Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model.<n>The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality.
arXiv Detail & Related papers (2025-06-04T23:53:49Z)
A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions [8.717610965852037]
We propose a novel training paradigm to generate diverse responses of a given proficiency level.<n>We convert responses into synthesized speech via speaker-aware text-to-speech synthesis.<n>A multimodal large language model integrates aligned textual features with speech signals to predict proficiency scores directly.
arXiv Detail & Related papers (2025-06-04T15:42:53Z)
$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction [80.57232374640911]
We propose a model-agnostic strategy called the Mask-And-Recover (MAR)<n>MAR integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules.<n>To better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model.
arXiv Detail & Related papers (2025-04-01T13:01:30Z)
Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation [20.415410280412697]
We propose an Adaptive Inner Speech-Text Alignment (AI-STA) method to bridge the modality gap by explicitly aligning speech and text representations at selected layers within large language models (LLMs)<n> Experimental results on speech translation tasks demonstrate that AI-STA significantly improves the translation performance of large speech-text models (LSMs), outperforming previous state-of-the-art approaches.
arXiv Detail & Related papers (2025-03-13T09:54:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.