Find Your Optimal Teacher: Personalized Data Synthesis via Router-Guided Multi-Teacher Distillation
- URL: http://arxiv.org/abs/2510.10925v1
- Date: Mon, 13 Oct 2025 02:36:36 GMT
- Title: Find Your Optimal Teacher: Personalized Data Synthesis via Router-Guided Multi-Teacher Distillation
- Authors: Hengyuan Zhang, Shiping Yang, Xiao Liang, Chenming Shang, Yuxuan Jiang, Chaofan Tao, Jing Xiong, Hayden Kwok-Hay So, Ruobing Xie, Angel X. Chang, Ngai Wong,
- Abstract summary: PerSyn operates under a new Route then Generate'' paradigm to create data tailored to each student model.<n>Experiments across different model families and scales demonstrate that PerSyn consistently achieves superior or comparable performance.
- Score: 47.814833568523255
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training student models on synthetic data generated by strong teacher models is a promising way to distilling the capabilities of teachers. However, recent studies show that stronger models are not always optimal teachers, revealing a mismatch between teacher outputs and student learnability. To address this issue, we propose PerSyn (Personalized data Synthesis), a novel synthesis strategy that operates under a new ``Route then Generate'' paradigm to create data tailored to each student model, enabling it to learn more effectively. Specifically, PerSyn first assigns each prompt to its optimal teacher via a query-level router that jointly considers student learnability and teacher response quality. Each teacher then synthesizes data only for its assigned prompts, making the process more efficient than the conventional ``Generate then Select'' paradigm, where all teachers must generate parallel responses for the entire prompt set before constructing the final dataset. Extensive experiments across different model families and scales demonstrate that PerSyn consistently achieves superior or comparable performance to all baselines in instruct tuning and math reasoning settings. Further analysis verifies the effectiveness of PerSyn and offers extra insights to propel future research.
Related papers
- A Novel Approach To Implementing Knowledge Distillation In Tsetlin Machines [0.0]
The Tsetlin Machine (TM) is a propositional logic based model that uses conjunctive clauses to learn patterns from data.<n>We propose a novel approach to implementing knowledge distillation in Tsetlin Machines by utilizing the probability distributions of each output sample in the teacher.<n>We find that our algorithm can significantly improve performance in the student model without negatively impacting latency in the tested domains of image recognition and text classification.
arXiv Detail & Related papers (2025-04-02T15:06:27Z) - Montessori-Instruct: Generate Influential Training Data Tailored for Student Learning [18.5518735004289]
We propose Montessori-Instruct, a novel data synthesis framework that tailors the data synthesis ability of the teacher language model toward the student language model's learning process.
Experiments show that Montessori-Instruct significantly outperforms standard synthesis methods by 18.35% and 46.24% relatively.
arXiv Detail & Related papers (2024-10-18T06:50:15Z) - SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation [55.2480439325792]
We study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor.
We find that SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance.
arXiv Detail & Related papers (2024-05-16T12:22:41Z) - YODA: Teacher-Student Progressive Learning for Language Models [82.0172215948963]
This paper introduces YODA, a teacher-student progressive learning framework.
It emulates the teacher-student education process to improve the efficacy of model fine-tuning.
Experiments show that training LLaMA2 with data from YODA improves SFT with significant performance gain.
arXiv Detail & Related papers (2024-01-28T14:32:15Z) - JEDI: Joint Expert Distillation in a Semi-Supervised Multi-Dataset
Student-Teacher Scenario for Video Action Recognition [29.67402932890899]
We propose JEDI, a multi-dataset semi-supervised learning method.
It efficiently combines knowledge from multiple experts, learned on different datasets, to train and improve the performance of individual, per dataset, student models.
arXiv Detail & Related papers (2023-08-09T13:09:07Z) - Customizing Synthetic Data for Data-Free Student Learning [6.8080936803807734]
DFKD aims to obtain a lightweight student model without original training data.
To more effectively train the student model, synthetic data shall be customized to the current student learning ability.
We propose Customizing Synthetic Data for Data-Free Student Learning (CSD) in this paper.
arXiv Detail & Related papers (2023-07-10T13:17:29Z) - EmbedDistill: A Geometric Knowledge Distillation for Information
Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z) - Synthetic data generation method for data-free knowledge distillation in
regression neural networks [0.0]
Knowledge distillation is the technique of compressing a larger neural network, known as the teacher, into a smaller neural network, known as the student.
Previous work has proposed a data-free knowledge distillation method where synthetic data are generated using a generator model trained adversarially against the student model.
In this study, we investigate the behavior of various synthetic data generation methods and propose a new synthetic data generation strategy.
arXiv Detail & Related papers (2023-01-11T07:26:00Z) - Distantly-Supervised Named Entity Recognition with Adaptive Teacher
Learning and Fine-grained Student Ensemble [56.705249154629264]
Self-training teacher-student frameworks are proposed to improve the robustness of NER models.
In this paper, we propose an adaptive teacher learning comprised of two teacher-student networks.
Fine-grained student ensemble updates each fragment of the teacher model with a temporal moving average of the corresponding fragment of the student, which enhances consistent predictions on each model fragment against noise.
arXiv Detail & Related papers (2022-12-13T12:14:09Z) - Iterative Teacher-Aware Learning [136.05341445369265]
In human pedagogy, teachers and students can interact adaptively to maximize communication efficiency.
We propose a gradient optimization based teacher-aware learner who can incorporate teacher's cooperative intention into the likelihood function.
arXiv Detail & Related papers (2021-10-01T00:27:47Z) - Graph Consistency based Mean-Teaching for Unsupervised Domain Adaptive
Person Re-Identification [54.58165777717885]
This paper proposes a Graph Consistency based Mean-Teaching (GCMT) method with constructing the Graph Consistency Constraint (GCC) between teacher and student networks.
Experiments on three datasets, i.e., Market-1501, DukeMTMCreID, and MSMT17, show that proposed GCMT outperforms state-of-the-art methods by clear margin.
arXiv Detail & Related papers (2021-05-11T04:09:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.