Representation Consolidation for Training Expert Students
- URL: http://arxiv.org/abs/2107.08039v1
- Date: Fri, 16 Jul 2021 17:58:18 GMT
- Title: Representation Consolidation for Training Expert Students
- Authors: Zhizhong Li, Avinash Ravichandran, Charless Fowlkes, Marzia Polito,
Rahul Bhotika, Stefano Soatto
- Abstract summary: We show that a multi-head, multi-task distillation method is sufficient to consolidate representations from task-specific teacher(s) and improve downstream performance.
Our method can also combine the representational knowledge of multiple teachers trained on one or multiple domains into a single model.
- Score: 54.90754502493968
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditionally, distillation has been used to train a student model to emulate
the input/output functionality of a teacher. A more useful goal than emulation,
yet under-explored, is for the student to learn feature representations that
transfer well to future tasks. However, we observe that standard distillation
of task-specific teachers actually *reduces* the transferability of student
representations to downstream tasks. We show that a multi-head, multi-task
distillation method using an unlabeled proxy dataset and a generalist teacher
is sufficient to consolidate representations from task-specific teacher(s) and
improve downstream performance, outperforming the teacher(s) and the strong
baseline of ImageNet pretrained features. Our method can also combine the
representational knowledge of multiple teachers trained on one or multiple
domains into a single model, whose representation is improved on all teachers'
domain(s).
Related papers
- PromptKD: Unsupervised Prompt Distillation for Vision-Language Models [40.858721356497085]
We introduce an unsupervised domain prompt distillation framework, which aims to transfer the knowledge of a larger teacher model to a lightweight target model.
Our framework consists of two distinct stages. In the initial stage, we pre-train a large CLIP teacher model using domain (few-shot) labels.
In the subsequent stage, the stored class vectors are shared across teacher and student image encoders for calculating the predicted logits.
arXiv Detail & Related papers (2024-03-05T08:53:30Z) - Let All be Whitened: Multi-teacher Distillation for Efficient Visual
Retrieval [57.17075479691486]
We propose a multi-teacher distillation framework Whiten-MTD, which is able to transfer knowledge from off-the-shelf pre-trained retrieval models to a lightweight student model for efficient visual retrieval.
Our source code is released at https://github.com/Maryeon/whiten_mtd.
arXiv Detail & Related papers (2023-12-15T11:43:56Z) - ERNIE 3.0 Tiny: Frustratingly Simple Method to Improve Task-Agnostic
Distillation Generalization [36.338614215561805]
Task-agnostic knowledge distillation attempts to address the problem of deploying large pretrained language model in resource-constrained scenarios.
We show that we can leverage multi-task learning in task-agnostic distillation to advance the generalization of the resulted student.
arXiv Detail & Related papers (2023-01-09T15:12:50Z) - Distilling Knowledge from Self-Supervised Teacher by Embedding Graph
Alignment [52.704331909850026]
We formulate a new knowledge distillation framework to transfer the knowledge from self-supervised pre-trained models to any other student network.
Inspired by the spirit of instance discrimination in self-supervised learning, we model the instance-instance relations by a graph formulation in the feature embedding space.
Our distillation scheme can be flexibly applied to transfer the self-supervised knowledge to enhance representation learning on various student networks.
arXiv Detail & Related papers (2022-11-23T19:27:48Z) - Improving Ensemble Distillation With Weight Averaging and Diversifying
Perturbation [22.87106703794863]
It motivates distilling knowledge from the ensemble teacher into a smaller student network.
We propose a weight averaging technique where a student with multipleworks is trained to absorb the functional diversity of ensemble teachers.
We also propose a perturbation strategy that seeks inputs from which the diversities of teachers can be better transferred to the student.
arXiv Detail & Related papers (2022-06-30T06:23:03Z) - Faculty Distillation with Optimal Transport [53.69235109551099]
We propose to link teacher's task and student's task by optimal transport.
Based on the semantic relationship between their label spaces, we can bridge the support gap between output distributions.
Experiments under various settings demonstrate the succinctness and versatility of our method.
arXiv Detail & Related papers (2022-04-25T09:34:37Z) - Graph Consistency based Mean-Teaching for Unsupervised Domain Adaptive
Person Re-Identification [54.58165777717885]
This paper proposes a Graph Consistency based Mean-Teaching (GCMT) method with constructing the Graph Consistency Constraint (GCC) between teacher and student networks.
Experiments on three datasets, i.e., Market-1501, DukeMTMCreID, and MSMT17, show that proposed GCMT outperforms state-of-the-art methods by clear margin.
arXiv Detail & Related papers (2021-05-11T04:09:49Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - Representation Transfer by Optimal Transport [34.77292648424614]
We use optimal transport to quantify the match between two representations.
This distance defines a regularizer promoting the similarity of the student's representation with that of the teacher.
arXiv Detail & Related papers (2020-07-13T23:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.