Related papers: Teaching-Assistant-in-the-Loop: Improving Knowledge Distillation from Imperfect Teacher Models in Low-Budget Scenarios

Teaching-Assistant-in-the-Loop: Improving Knowledge Distillation from Imperfect Teacher Models in Low-Budget Scenarios

URL: http://arxiv.org/abs/2406.05322v1
Date: Sat, 8 Jun 2024 02:17:43 GMT
Title: Teaching-Assistant-in-the-Loop: Improving Knowledge Distillation from Imperfect Teacher Models in Low-Budget Scenarios
Authors: Yuhang Zhou, Wei Ai,
Abstract summary: We propose a three-component framework leveraging three signal types. The first signal is the student's self-consistency (consistency of student multiple outputs), which is a proxy of the student's confidence. We show that our proposed two-stage framework brings a relative improvement of up to 20.79% compared to fine-tuning without any signals across datasets.
Score: 3.818273633647809
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: There is increasing interest in distilling task-specific knowledge from large language models (LLM) to smaller student models. Nonetheless, LLM distillation presents a dual challenge: 1) there is a high cost associated with querying the teacher LLM, such as GPT-4, for gathering an ample number of demonstrations; 2) the teacher LLM might provide imperfect outputs with a negative impact on the student's learning process. To enhance sample efficiency within resource-constrained, imperfect teacher scenarios, we propose a three-component framework leveraging three signal types. The first signal is the student's self-consistency (consistency of student multiple outputs), which is a proxy of the student's confidence. Specifically, we introduce a ``teaching assistant'' (TA) model to assess the uncertainty of both the student's and the teacher's outputs via confidence scoring, which serves as another two signals for student training. Furthermore, we propose a two-stage training schema to first warm up the student with a small proportion of data to better utilize student's signal. Experiments have shown the superiority of our proposed framework for four complex reasoning tasks. On average, our proposed two-stage framework brings a relative improvement of up to 20.79% compared to fine-tuning without any signals across datasets.

Related papers

AdaSwitch: Adaptive Switching Generation for Knowledge Distillation [58.647880811071495]
Small language models (SLMs) are crucial for applications with strict latency and computational constraints.<n>We propose AdaSwitch, a novel approach that combines on-policy and off-policy generation at the token level.<n>AdaSwitch consistently improves accuracy, offering a practical and effective method for distilling SLMs with acceptable additional overhead.
arXiv Detail & Related papers (2025-10-09T06:38:37Z)
Alice: Proactive Learning with Teacher's Demonstrations for Weak-to-Strong Generalization [69.96794098855938]
Weak-to-strong generalization (W2SG) offers a promising framework for supervising increasingly capable language models (LLMs) Traditional W2SG methods rely on passive learning, where a weak teacher provides noisy demonstrations to train a strong student. We introduce Alice, a framework that leverages complementary knowledge between teacher and student to enhance the learning process.
arXiv Detail & Related papers (2025-04-09T22:33:06Z)
DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs [58.4911494598431]
DistiLLM-2 is a contrastive approach that simultaneously increases the likelihood of teacher responses and decreases that of student responses. Our experiments show that DistiLLM-2 not only builds high-performing student models across a wide range of tasks, but also supports diverse applications.
arXiv Detail & Related papers (2025-03-10T08:51:32Z)
SEFL: Enhancing Educational Assignment Feedback with LLM Agents [5.191286314473505]
Synthetic Educational Feedback Loops (SEFL) is a synthetic data framework designed to generate data that resembles immediate, on-demand feedback at scale.<n>To get this type of data, two large language models (LLMs) operate in teacher-student roles to simulate assignment completion and formative feedback.<n>We show that SEFL-tuned models outperform both their non-tuned counterparts in feedback quality and an existing baseline.
arXiv Detail & Related papers (2025-02-18T15:09:29Z)
CFTS-GAN: Continual Few-Shot Teacher Student for Generative Adversarial Networks [0.5024983453990064]
Few-shot and continual learning face two well-known challenges in GANs: overfitting and catastrophic forgetting. This paper proposes a Continual Few-shot Teacher-Student technique for the generative adversarial network (CFTS-GAN) that considers both challenges together.
arXiv Detail & Related papers (2024-10-17T20:49:08Z)
Interactive DualChecker for Mitigating Hallucinations in Distilling Large Language Models [7.632217365130212]
Large Language Models (LLMs) have demonstrated exceptional capabilities across various machine learning (ML) tasks. These models can produce hallucinations, particularly in domains with incomplete knowledge. We introduce DualChecker, an innovative framework designed to mitigate hallucinations and improve the performance of both teacher and student models.
arXiv Detail & Related papers (2024-08-22T12:04:04Z)
Aligning Teacher with Student Preferences for Tailored Training Data Generation [40.85451525264779]
We propose ARTE, dubbed Aligning TeacheR with StudenT PreferencEs, to generate tailored training examples for Knowledge Distillation. Specifically, we elicit draft questions and rationales from the teacher model, then collect student preferences on these questions and rationales. In the end, we repeat the first step with the aligned teacher model to elicit tailored training examples for the student model on the target task.
arXiv Detail & Related papers (2024-06-27T14:51:17Z)
Distillation Matters: Empowering Sequential Recommenders to Match the Performance of Large Language Model [12.6937643116018]
Large Language Models (LLMs) have been effectively utilized as recommenders, achieving impressive performance. However, the high inference latency of LLMs significantly restricts their practical deployment. This work investigates knowledge distillation from cumbersome LLM-based recommendation models to lightweight sequential models.
arXiv Detail & Related papers (2024-05-01T06:23:54Z)
Can Language Models Teach Weaker Agents? Teacher Explanations Improve Students via Personalization [84.86241161706911]
We show that teacher LLMs can indeed intervene on student reasoning to improve their performance. We also demonstrate that in multi-turn interactions, teacher explanations generalize and learn from explained data. We verify that misaligned teachers can lower student performance to random chance by intentionally misleading them.
arXiv Detail & Related papers (2023-06-15T17:27:20Z)
Distantly-Supervised Named Entity Recognition with Adaptive Teacher Learning and Fine-grained Student Ensemble [56.705249154629264]
Self-training teacher-student frameworks are proposed to improve the robustness of NER models. In this paper, we propose an adaptive teacher learning comprised of two teacher-student networks. Fine-grained student ensemble updates each fragment of the teacher model with a temporal moving average of the corresponding fragment of the student, which enhances consistent predictions on each model fragment against noise.
arXiv Detail & Related papers (2022-12-13T12:14:09Z)
Dynamic Contrastive Distillation for Image-Text Retrieval [90.05345397400144]
We present a novel plug-in dynamic contrastive distillation (DCD) framework to compress image-text retrieval models. We successfully apply our proposed DCD strategy to two state-of-the-art vision-language pretrained models, i.e. ViLT and METER. Experiments on MS-COCO and Flickr30K benchmarks show the effectiveness and efficiency of our DCD framework.
arXiv Detail & Related papers (2022-07-04T14:08:59Z)
Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation. Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z)
Learning to Teach with Student Feedback [67.41261090761834]
Interactive Knowledge Distillation (IKD) allows the teacher to learn to teach from the feedback of the student. IKD trains the teacher model to generate specific soft target at each training step for a certain student. Joint optimization for both teacher and student is achieved by two iterative steps.
arXiv Detail & Related papers (2021-09-10T03:01:01Z)
Noisy Self-Knowledge Distillation for Text Summarization [83.49809205891496]
We apply self-knowledge distillation to text summarization which we argue can alleviate problems with maximum-likelihood training. Our student summarization model is trained with guidance from a teacher which generates smoothed labels to help regularize training. We demonstrate experimentally on three benchmarks that our framework boosts the performance of both pretrained and non-pretrained summarizers.
arXiv Detail & Related papers (2020-09-15T12:53:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.