Teaching Audio Models to Reason: A Unified Framework for Source- and Layer-wise Distillation
- URL: http://arxiv.org/abs/2509.18579v1
- Date: Tue, 23 Sep 2025 02:58:16 GMT
- Title: Teaching Audio Models to Reason: A Unified Framework for Source- and Layer-wise Distillation
- Authors: Runyan Yang, Yuke Si, Yingying Gao, Junlan Feng, Chao Deng, Shilei Zhang,
- Abstract summary: We propose a unified knowledge distillation framework to transfer reasoning capabilities from a high-capacity textual model to a student audio model.<n>Our method introduces two key dimensions: source-wise distillation and layer-wise distillation.<n> Experimental results show significant improvements in audio reasoning performance.
- Score: 52.537908557508324
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While large audio language models excel at tasks like ASR and emotion recognition, they still struggle with complex reasoning due to the modality gap between audio and text as well as the lack of structured intermediate supervision. To address this, we propose a unified knowledge distillation framework to transfer reasoning capabilities from a high-capacity textual teacher model to a student audio models while preserving its acoustic competence. Our method introduces two key dimensions: source-wise distillation, which leverages both textual and acoustic teachers to provide complementary modality-specific supervision; and layer-wise distillation, which aligns teacher signals with appropriate student layers to improve transfer efficiency. This dual-dimensional strategy enables fine-grained control over the distillation process, effectively bridging the gap between symbolic reasoning and speech representations. Experimental results show significant improvements in audio reasoning performance, demonstrating the effectiveness of our framework as a reasoning transfer solution for audio modeling.
Related papers
- Covo-Audio Technical Report [61.09708870154148]
Covo-Audio, a 7B-end LALM, directly processes continuous audio inputs and generates audio outputs within a single unified architecture.<n>Covo-Audio-Chat, a dialogue-oriented variant, demonstrates semantic strong spoken conversational abilities.
arXiv Detail & Related papers (2026-02-10T14:31:11Z) - Cross-Modal Knowledge Distillation for Speech Large Language Models [10.840179376551804]
We show that introducing speech capabilities can degrade knowledge and reasoning even when inputs remain textual.<n>We propose a cross-modal knowledge distillation framework that leverages both text-to-text and speech-to-text channels to transfer knowledge from a text-based teacher model to a speech LLM.
arXiv Detail & Related papers (2025-09-18T13:07:53Z) - AURORA: Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation [113.75682363364004]
AURORA is a framework designed to enhance genuine reasoning and language comprehension in reference audio-visual segmentation.<n>AURORA achieves state-of-the-art performance on Ref-AVS benchmarks and generalizes effectively to unreferenced segmentation.
arXiv Detail & Related papers (2025-08-04T07:47:38Z) - Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs [47.325269852330884]
We develop a strategy to enhance emotion recognition by producing semantically aligned, evidence-grounded explanations.<n>We introduce a unified framework combining reasoning-augmented data supervision, dual-encoder architecture, and task-alternating training.<n> Experiments on IEMOCAP and MELD show that our approach not only improves emotion prediction accuracy but also enhances the coherence and evidential grounding of the generated responses.
arXiv Detail & Related papers (2025-06-07T14:52:58Z) - Towards Robust Overlapping Speech Detection: A Speaker-Aware Progressive Approach Using WavLM [53.17360668423001]
Overlapping Speech Detection (OSD) aims to identify regions where multiple speakers overlap in a conversation.<n>This work proposes a speaker-aware progressive OSD model that leverages a progressive training strategy to enhance the correlation between subtasks.<n> Experimental results show that the proposed method achieves state-of-the-art performance, with an F1 score of 82.76% on the AMI test set.
arXiv Detail & Related papers (2025-05-29T07:47:48Z) - Multi-Stage Speaker Diarization for Noisy Classrooms [1.4549461207028445]
This study investigates the effectiveness of multi-stage diarization models using Nvidia's NeMo diarization pipeline.<n>We assess the impact of denoising on diarization accuracy and compare various voice activity detection models.<n>We also explore a hybrid VAD approach that integrates Automatic Speech Recognition (ASR) word-level timestamps with frame-level VAD predictions.
arXiv Detail & Related papers (2025-05-16T05:35:06Z) - Self supervised learning for robust voice cloning [3.7989740031754806]
We use features learned in a self-supervised framework to produce high quality speech representations.
The learned features are used as pre-trained utterance-level embeddings and as inputs to a Non-Attentive Tacotron based architecture.
This method enables us to train our model in an unlabeled multispeaker dataset as well as use unseen speaker embeddings to copy a speaker's voice.
arXiv Detail & Related papers (2022-04-07T13:05:24Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.