Cross-Modal Knowledge Distillation for Speech Large Language Models
- URL: http://arxiv.org/abs/2509.14930v1
- Date: Thu, 18 Sep 2025 13:07:53 GMT
- Title: Cross-Modal Knowledge Distillation for Speech Large Language Models
- Authors: Enzhi Wang, Qicheng Li, Zhiyuan Tang, Yuhang Jia,
- Abstract summary: We show that introducing speech capabilities can degrade knowledge and reasoning even when inputs remain textual.<n>We propose a cross-modal knowledge distillation framework that leverages both text-to-text and speech-to-text channels to transfer knowledge from a text-based teacher model to a speech LLM.
- Score: 10.840179376551804
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we present the first systematic evaluation of catastrophic forgetting and modality inequivalence in speech large language models, showing that introducing speech capabilities can degrade knowledge and reasoning even when inputs remain textual, and performance further decreases with spoken queries. To address these challenges, we propose a cross-modal knowledge distillation framework that leverages both text-to-text and speech-to-text channels to transfer knowledge from a text-based teacher model to a speech LLM. Extensive experiments on dialogue and audio understanding tasks validate the effectiveness of our approach in preserving textual knowledge, improving cross-modal alignment, and enhancing reasoning in speech-based interactions.
Related papers
- MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance [66.74042564585942]
MOSS-Speech is a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance.<n>Our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.
arXiv Detail & Related papers (2025-10-01T04:32:37Z) - Teaching Audio Models to Reason: A Unified Framework for Source- and Layer-wise Distillation [52.537908557508324]
We propose a unified knowledge distillation framework to transfer reasoning capabilities from a high-capacity textual model to a student audio model.<n>Our method introduces two key dimensions: source-wise distillation and layer-wise distillation.<n> Experimental results show significant improvements in audio reasoning performance.
arXiv Detail & Related papers (2025-09-23T02:58:16Z) - Towards Developmentally Plausible Rewards: Communicative Success as a Learning Signal for Interactive Language Models [49.22720751953838]
We propose a method for training language models in an interactive setting inspired by child language acquisition.<n>In our setting, a speaker attempts to communicate some information to a listener in a single-turn dialogue and receives a reward if communicative success is achieved.
arXiv Detail & Related papers (2025-05-09T11:48:36Z) - Linguistic Knowledge Transfer Learning for Speech Enhancement [29.191204225828354]
Linguistic knowledge plays a crucial role in spoken language comprehension.<n>Most speech enhancement methods rely on acoustic features to learn the mapping relationship between noisy and clean speech.<n>We propose the Cross-Modality Knowledge Transfer (CMKT) learning framework to integrate linguistic knowledge into SE models.
arXiv Detail & Related papers (2025-03-10T09:00:18Z) - Towards Harnessing Large Language Models for Comprehension of Conversational Grounding [1.8434042562191812]
This study investigates the capabilities of large language models in classifying dialogue turns related to explicit or implicit grounding and predicting grounded knowledge elements.
Our experimental results reveal challenges encountered by large language models in the two tasks.
These initiatives aim to develop more effective dialogue systems that are better equipped to handle the intricacies of grounded knowledge in conversations.
arXiv Detail & Related papers (2024-06-03T19:34:39Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - A Bag of Tricks for Dialogue Summarization [7.7837843673493685]
We explore four different challenges of the task: handling and differentiating parts of the dialogue belonging to multiple speakers, negation understanding, reasoning about the situation, and informal language understanding.
Using a pretrained sequence-to-sequence language model, we explore speaker name substitution, negation scope highlighting, multi-task learning with relevant tasks, and pretraining on in-domain data.
arXiv Detail & Related papers (2021-09-16T21:32:02Z) - Structural Pre-training for Dialogue Comprehension [51.215629336320305]
We present SPIDER, Structural Pre-traIned DialoguE Reader, to capture dialogue exclusive features.
To simulate the dialogue-like features, we propose two training objectives in addition to the original LM objectives.
Experimental results on widely used dialogue benchmarks verify the effectiveness of the newly introduced self-supervised tasks.
arXiv Detail & Related papers (2021-05-23T15:16:54Z) - Retrieval-Free Knowledge-Grounded Dialogue Response Generation with
Adapters [52.725200145600624]
We propose KnowExpert to bypass the retrieval process by injecting prior knowledge into the pre-trained language models with lightweight adapters.
Experimental results show that KnowExpert performs comparably with the retrieval-based baselines.
arXiv Detail & Related papers (2021-05-13T12:33:23Z) - Multi-turn Dialogue Reading Comprehension with Pivot Turns and Knowledge [43.352833140317486]
Multi-turn dialogue reading comprehension aims to teach machines to read dialogue contexts and solve tasks such as response selection and answering questions.
This work makes the first attempt to tackle the above two challenges by extracting substantially important turns as pivot utterances.
We propose a pivot-oriented deep selection model (PoDS) on top of the Transformer-based language models for dialogue comprehension.
arXiv Detail & Related papers (2021-02-10T15:00:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.