Exploring Knowledge Purification in Multi-Teacher Knowledge Distillation for LLMs
- URL: http://arxiv.org/abs/2602.01064v1
- Date: Sun, 01 Feb 2026 07:19:57 GMT
- Title: Exploring Knowledge Purification in Multi-Teacher Knowledge Distillation for LLMs
- Authors: Ruihan Jin, Pengpeng Shao, Zhengqi Wen, Jinyang Wu, Mingkuan Feng, Shuo Yang, Chu Yuan Zhang, Jianhua Tao,
- Abstract summary: We introduce the concept of textbfKnowledge Purification, which consolidates the rationales from multiple teacher LLMs into a single rationale.<n>Our experiments demonstrate that these methods not only improve the performance of the distilled model but also effectively alleviate knowledge conflicts.
- Score: 30.543860587982895
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation has emerged as a pivotal technique for transferring knowledge from stronger large language models (LLMs) to smaller, more efficient models. However, traditional distillation approaches face challenges related to knowledge conflicts and high resource demands, particularly when leveraging multiple teacher models. In this paper, we introduce the concept of \textbf{Knowledge Purification}, which consolidates the rationales from multiple teacher LLMs into a single rationale, thereby mitigating conflicts and enhancing efficiency. To investigate the effectiveness of knowledge purification, we further propose five purification methods from various perspectives. Our experiments demonstrate that these methods not only improve the performance of the distilled model but also effectively alleviate knowledge conflicts. Moreover, router-based methods exhibit robust generalization capabilities, underscoring the potential of innovative purification techniques in optimizing multi-teacher distillation and facilitating the practical deployment of powerful yet lightweight models.
Related papers
- Pedagogically-Inspired Data Synthesis for Language Model Knowledge Distillation [63.302074484672424]
We propose a pedagogically-inspired framework for knowledge distillation.<n>Our approach identifies knowledge deficiencies in student models, organizes knowledge delivery through progressive curricula, and adapts representations to match cognitive capacity of student models.<n>Our framework particularly excels in complex reasoning tasks, showing 19.2% improvement on MATH and 22.3% on HumanEval compared with state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-12T17:00:36Z) - Efficient Knowledge Injection in LLMs via Self-Distillation [50.24554628642021]
This paper proposes utilizing prompt distillation to internalize new factual knowledge from free-form documents.<n>We show that prompt distillation outperforms standard supervised fine-tuning and can even surpass RAG.
arXiv Detail & Related papers (2024-12-19T15:44:01Z) - MLKD-BERT: Multi-level Knowledge Distillation for Pre-trained Language Models [4.404914701832396]
We propose a novel knowledge distillation method MLKD-BERT to distill multi-level knowledge in teacher-student framework.
Our method outperforms state-of-the-art knowledge distillation methods on BERT.
In addition, MLKD-BERT can flexibly set student attention head number, allowing for substantial inference time decrease with little performance drop.
arXiv Detail & Related papers (2024-07-03T03:03:30Z) - Efficient Multi-Model Fusion with Adversarial Complementary Representation Learning [26.393644289860084]
Single-model systems often suffer from deficiencies in tasks such as speaker verification (SV) and image classification.
We propose an adversarial complementary representation learning (ACoRL) framework that enables newly trained models to avoid previously acquired knowledge.
arXiv Detail & Related papers (2024-04-24T07:47:55Z) - Knowledge Distillation for Road Detection based on cross-model Semi-Supervised Learning [17.690698736544626]
We propose an integrated approach that combines knowledge distillation and semi-supervised learning methods.
This hybrid approach leverages the robust capabilities of large models to effectively utilise large unlabelled data.
The proposed semi-supervised learning-based knowledge distillation (SSLKD) approach demonstrates a notable improvement in the performance of the student model.
arXiv Detail & Related papers (2024-02-07T22:50:47Z) - MinT: Boosting Generalization in Mathematical Reasoning via Multi-View
Fine-Tuning [53.90744622542961]
Reasoning in mathematical domains remains a significant challenge for small language models (LMs)
We introduce a new method that exploits existing mathematical problem datasets with diverse annotation styles.
Experimental results show that our strategy enables a LLaMA-7B model to outperform prior approaches.
arXiv Detail & Related papers (2023-07-16T05:41:53Z) - A Closer Look at Knowledge Distillation with Features, Logits, and
Gradients [81.39206923719455]
Knowledge distillation (KD) is a substantial strategy for transferring learned knowledge from one neural network model to another.
This work provides a new perspective to motivate a set of knowledge distillation strategies by approximating the classical KL-divergence criteria with different knowledge sources.
Our analysis indicates that logits are generally a more efficient knowledge source and suggests that having sufficient feature dimensions is crucial for the model design.
arXiv Detail & Related papers (2022-03-18T21:26:55Z) - On the benefits of knowledge distillation for adversarial robustness [53.41196727255314]
We show that knowledge distillation can be used directly to boost the performance of state-of-the-art models in adversarial robustness.
We present Adversarial Knowledge Distillation (AKD), a new framework to improve a model's robust performance.
arXiv Detail & Related papers (2022-03-14T15:02:13Z) - On the Orthogonality of Knowledge Distillation with Other Techniques:
From an Ensemble Perspective [34.494730096460636]
We show that knowledge distillation is a powerful apparatus for practical deployment of efficient neural network.
We also introduce ways to integrate knowledge distillation with other methods effectively.
arXiv Detail & Related papers (2020-09-09T06:14:59Z) - Transfer Heterogeneous Knowledge Among Peer-to-Peer Teammates: A Model
Distillation Approach [55.83558520598304]
We propose a brand new solution to reuse experiences and transfer value functions among multiple students via model distillation.
We also describe how to design an efficient communication protocol to exploit heterogeneous knowledge.
Our proposed framework, namely Learning and Teaching Categorical Reinforcement, shows promising performance on stabilizing and accelerating learning progress.
arXiv Detail & Related papers (2020-02-06T11:31:04Z) - Learning From Multiple Experts: Self-paced Knowledge Distillation for
Long-tailed Classification [106.08067870620218]
We propose a self-paced knowledge distillation framework, termed Learning From Multiple Experts (LFME)
We refer to these models as 'Experts', and the proposed LFME framework aggregates the knowledge from multiple 'Experts' to learn a unified student model.
We conduct extensive experiments and demonstrate that our method is able to achieve superior performances compared to state-of-the-art methods.
arXiv Detail & Related papers (2020-01-06T12:57:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.