Related papers: Who Taught You That? Tracing Teachers in Model Distillation

Who Taught You That? Tracing Teachers in Model Distillation

URL: http://arxiv.org/abs/2502.06659v1
Date: Mon, 10 Feb 2025 16:48:56 GMT
Title: Who Taught You That? Tracing Teachers in Model Distillation
Authors: Somin Wadhwa, Chantal Shaib, Silvio Amir, Byron C. Wallace,
Abstract summary: We ask: Can we identify a students' teacher based on its outputs?<n>We consider practical task distillation targets including summarization, question answering, and instruction-following.<n>We design discriminative models that operate over lexical features.
Score: 23.566776089005963
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Model distillation -- using outputs from a large teacher model to teach a small student model -- is a practical means of creating efficient models for a particular task. We ask: Can we identify a students' teacher based on its outputs? Such "footprints" left by teacher LLMs would be interesting artifacts. Beyond this, reliable teacher inference may have practical implications as actors seek to distill specific capabilities of massive proprietary LLMs into deployed smaller LMs, potentially violating terms of service. We consider practical task distillation targets including summarization, question answering, and instruction-following. We assume a finite set of candidate teacher models, which we treat as blackboxes. We design discriminative models that operate over lexical features. We find that $n$-gram similarity alone is unreliable for identifying teachers, but part-of-speech (PoS) templates preferred by student models mimic those of their teachers.

Related papers

Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only. Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z)
Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them. This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model. OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z)
Aligning Teacher with Student Preferences for Tailored Training Data Generation [40.85451525264779]
We propose ARTE, dubbed Aligning TeacheR with StudenT PreferencEs, to generate tailored training examples for Knowledge Distillation. Specifically, we elicit draft questions and rationales from the teacher model, then collect student preferences on these questions and rationales. In the end, we repeat the first step with the aligned teacher model to elicit tailored training examples for the student model on the target task.
arXiv Detail & Related papers (2024-06-27T14:51:17Z)
PLaD: Preference-based Large Language Model Distillation with Pseudo-Preference Pairs [47.35598271306371]
Large Language Models (LLMs) have exhibited impressive capabilities in various tasks, yet their vast parameter sizes restrict their applicability in resource-constrained settings. Knowledge distillation (KD) offers a viable solution by transferring expertise from large teacher models to compact student models. We present PLaD, a novel preference-based LLM distillation framework.
arXiv Detail & Related papers (2024-06-05T03:08:25Z)
Large Language Models are In-context Teachers for Knowledge Reasoning [8.869111204842248]
We study in-context teaching (ICT) where a teacher provides in-context example rationales to teach a student to reason over unseen cases. We ask whether a large language model (LLM) can serve as a more effective in-context teacher for itself or other LLMs, compared to humans.
arXiv Detail & Related papers (2023-11-12T23:14:43Z)
Representation Consolidation for Training Expert Students [54.90754502493968]
We show that a multi-head, multi-task distillation method is sufficient to consolidate representations from task-specific teacher(s) and improve downstream performance. Our method can also combine the representational knowledge of multiple teachers trained on one or multiple domains into a single model.
arXiv Detail & Related papers (2021-07-16T17:58:18Z)
One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers [54.146208195806636]
We propose a multi-teacher knowledge distillation framework named MT-BERT for pre-trained language model compression. We show that MT-BERT can train high-quality student model from multiple teacher PLMs. Experiments on three benchmark datasets validate the effectiveness of MT-BERT in compressing PLMs.
arXiv Detail & Related papers (2021-06-02T08:42:33Z)
Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression. Current methods assign a fixed weight to a teacher model in the whole distillation. Most of the existing methods allocate an equal weight to every teacher model. In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.