Pedagogically-Inspired Data Synthesis for Language Model Knowledge Distillation
- URL: http://arxiv.org/abs/2602.12172v1
- Date: Thu, 12 Feb 2026 17:00:36 GMT
- Title: Pedagogically-Inspired Data Synthesis for Language Model Knowledge Distillation
- Authors: Bowei He, Yankai Chen, Xiaokun Zhang, Linghe Kong, Philip S. Yu, Xue Liu, Chen Ma,
- Abstract summary: We propose a pedagogically-inspired framework for knowledge distillation.<n>Our approach identifies knowledge deficiencies in student models, organizes knowledge delivery through progressive curricula, and adapts representations to match cognitive capacity of student models.<n>Our framework particularly excels in complex reasoning tasks, showing 19.2% improvement on MATH and 22.3% on HumanEval compared with state-of-the-art baselines.
- Score: 63.302074484672424
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation from Large Language Models (LLMs) to smaller models has emerged as a critical technique for deploying efficient AI systems. However, current methods for distillation via synthetic data lack pedagogical awareness, treating knowledge transfer as a one-off data synthesis and training task rather than a systematic learning process. In this paper, we propose a novel pedagogically-inspired framework for LLM knowledge distillation that draws from fundamental educational principles. Our approach introduces a three-stage pipeline -- Knowledge Identifier, Organizer, and Adapter (IOA) -- that systematically identifies knowledge deficiencies in student models, organizes knowledge delivery through progressive curricula, and adapts representations to match the cognitive capacity of student models. We integrate Bloom's Mastery Learning Principles and Vygotsky's Zone of Proximal Development to create a dynamic distillation process where student models approach teacher model's performance on prerequisite knowledge before advancing, and new knowledge is introduced with controlled, gradual difficulty increments. Extensive experiments using LLaMA-3.1/3.2 and Qwen2.5 as student models demonstrate that IOA achieves significant improvements over baseline distillation methods, with student models retaining 94.7% of teacher performance on DollyEval while using less than 1/10th of the parameters. Our framework particularly excels in complex reasoning tasks, showing 19.2% improvement on MATH and 22.3% on HumanEval compared with state-of-the-art baselines.
Related papers
- Honey, I Shrunk the Language Model: Impact of Knowledge Distillation Methods on Performance and Explainability [3.224880576815583]
High computational and storage demands of Large Language Models limit their deployment in resource-constrained environments.<n>Previous research has introduced several distillation methods for both generating training data and for training the student model.<n>Despite their relevance, the effects of state-of-the-art distillation methods on model performance and explainability have not been thoroughly investigated.
arXiv Detail & Related papers (2025-04-22T17:32:48Z) - Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only.<n> Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z) - Efficient Knowledge Injection in LLMs via Self-Distillation [50.24554628642021]
This paper proposes utilizing prompt distillation to internalize new factual knowledge from free-form documents.<n>We show that prompt distillation outperforms standard supervised fine-tuning and can even surpass RAG.
arXiv Detail & Related papers (2024-12-19T15:44:01Z) - Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning [79.46570165281084]
We propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods.
MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections.
Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks.
arXiv Detail & Related papers (2024-11-11T07:36:19Z) - MiniVLN: Efficient Vision-and-Language Navigation by Progressive Knowledge Distillation [17.27883003990266]
Vision-and-Language Navigation (VLN) is a core task in Embodied AI.
This paper introduces a two-stage knowledge distillation framework, producing a student model, MiniVLN.
Our findings indicate that the two-stage distillation approach is more effective in narrowing the performance gap between the teacher model and the student model.
arXiv Detail & Related papers (2024-09-27T14:54:54Z) - Learning to Maximize Mutual Information for Chain-of-Thought Distillation [13.660167848386806]
Distilling Step-by-Step(DSS) has demonstrated promise by imbuing smaller models with the superior reasoning capabilities of their larger counterparts.
However, DSS overlooks the intrinsic relationship between the two training tasks, leading to ineffective integration of CoT knowledge with the task of label prediction.
We propose a variational approach to solve this problem using a learning-based method.
arXiv Detail & Related papers (2024-03-05T22:21:45Z) - Knowledge Distillation for Road Detection based on cross-model Semi-Supervised Learning [17.690698736544626]
We propose an integrated approach that combines knowledge distillation and semi-supervised learning methods.
This hybrid approach leverages the robust capabilities of large models to effectively utilise large unlabelled data.
The proposed semi-supervised learning-based knowledge distillation (SSLKD) approach demonstrates a notable improvement in the performance of the student model.
arXiv Detail & Related papers (2024-02-07T22:50:47Z) - Learning Lightweight Object Detectors via Multi-Teacher Progressive
Distillation [56.053397775016755]
We propose a sequential approach to knowledge distillation that progressively transfers the knowledge of a set of teacher detectors to a given lightweight student.
To the best of our knowledge, we are the first to successfully distill knowledge from Transformer-based teacher detectors to convolution-based students.
arXiv Detail & Related papers (2023-08-17T17:17:08Z) - AD-KD: Attribution-Driven Knowledge Distillation for Language Model
Compression [26.474962405945316]
We present a novel attribution-driven knowledge distillation approach to compress pre-trained language models.
To enhance the knowledge transfer of model reasoning and generalization, we explore multi-view attribution distillation on all potential decisions of the teacher.
arXiv Detail & Related papers (2023-05-17T07:40:12Z) - Extracting knowledge from features with multilevel abstraction [3.4443503349903124]
Self-knowledge distillation (SKD) aims at transferring the knowledge from a large teacher model to a small student model.
In this paper, we purpose a novel SKD method in a different way from the main stream methods.
Experiments and ablation studies show its great effectiveness and generalization on various kinds of tasks.
arXiv Detail & Related papers (2021-12-04T02:25:46Z) - Learning Student-Friendly Teacher Networks for Knowledge Distillation [50.11640959363315]
We propose a novel knowledge distillation approach to facilitate the transfer of dark knowledge from a teacher to a student.
Contrary to most of the existing methods that rely on effective training of student models given pretrained teachers, we aim to learn the teacher models that are friendly to students.
arXiv Detail & Related papers (2021-02-12T07:00:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.