Less or More From Teacher: Exploiting Trilateral Geometry For Knowledge
Distillation
- URL: http://arxiv.org/abs/2312.15112v3
- Date: Mon, 19 Feb 2024 00:32:52 GMT
- Title: Less or More From Teacher: Exploiting Trilateral Geometry For Knowledge
Distillation
- Authors: Chengming Hu, Haolun Wu, Xuan Li, Chen Ma, Xi Chen, Jun Yan, Boyu
Wang, Xue Liu
- Abstract summary: We introduce a novel adaptive method for learning a sample-wise knowledge fusion ratio.
We exploit both the correctness of teacher and student, as well as how well the student mimics the teacher on each sample.
A simple neural network then learns the implicit mapping from the intra- and inter-sample relations to an adaptive, sample-wise knowledge fusion ratio.
- Score: 21.913044821863636
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation aims to train a compact student network using soft
supervision from a larger teacher network and hard supervision from ground
truths. However, determining an optimal knowledge fusion ratio that balances
these supervisory signals remains challenging. Prior methods generally resort
to a constant or heuristic-based fusion ratio, which often falls short of a
proper balance. In this study, we introduce a novel adaptive method for
learning a sample-wise knowledge fusion ratio, exploiting both the correctness
of teacher and student, as well as how well the student mimics the teacher on
each sample. Our method naturally leads to the intra-sample trilateral
geometric relations among the student prediction ($S$), teacher prediction
($T$), and ground truth ($G$). To counterbalance the impact of outliers, we
further extend to the inter-sample relations, incorporating the teacher's
global average prediction $\bar{T}$ for samples within the same class. A simple
neural network then learns the implicit mapping from the intra- and
inter-sample relations to an adaptive, sample-wise knowledge fusion ratio in a
bilevel-optimization manner. Our approach provides a simple, practical, and
adaptable solution for knowledge distillation that can be employed across
various architectures and model sizes. Extensive experiments demonstrate
consistent improvements over other loss re-weighting methods on image
classification, attack detection, and click-through rate prediction.
Related papers
- Faithful Label-free Knowledge Distillation [8.572967695281054]
This paper presents a label-free knowledge distillation approach called Teacher in the Middle (TinTeM)
It produces a more faithful student, which better replicates the behavior of the teacher network across a range of benchmarks testing model robustness, generalisability and out-of-distribution detection.
arXiv Detail & Related papers (2024-11-22T01:48:44Z) - CKD: Contrastive Knowledge Distillation from A Sample-wise Perspective [48.99488315273868]
We present a contrastive knowledge distillation approach, which can be formulated as a sample-wise alignment problem with intra- and inter-sample constraints.
Our method minimizes logit differences within the same sample by considering their numerical values.
We conduct comprehensive experiments on three datasets including CIFAR-100, ImageNet-1K, and MS COCO.
arXiv Detail & Related papers (2024-04-22T11:52:40Z) - Mitigating Accuracy-Robustness Trade-off via Balanced Multi-Teacher Adversarial Distillation [12.39860047886679]
Adversarial Training is a practical approach for improving the robustness of deep neural networks against adversarial attacks.
We introduce Balanced Multi-Teacher Adversarial Robustness Distillation (B-MTARD) to guide the model's Adversarial Training process.
B-MTARD outperforms the state-of-the-art methods against various adversarial attacks.
arXiv Detail & Related papers (2023-06-28T12:47:01Z) - Teaching What You Should Teach: A Data-Based Distillation Method [20.595460553747163]
We introduce the "Teaching what you Should Teach" strategy into a knowledge distillation framework.
We propose a data-based distillation method named "TST" that searches for desirable augmented samples to assist in distilling more efficiently and rationally.
To be specific, we design a neural network-based data augmentation module with priori bias, which assists in finding what meets the teacher's strengths but the student's weaknesses.
arXiv Detail & Related papers (2022-12-11T06:22:14Z) - Intra-class Adaptive Augmentation with Neighbor Correction for Deep
Metric Learning [99.14132861655223]
We propose a novel intra-class adaptive augmentation (IAA) framework for deep metric learning.
We reasonably estimate intra-class variations for every class and generate adaptive synthetic samples to support hard samples mining.
Our method significantly improves and outperforms the state-of-the-art methods on retrieval performances by 3%-6%.
arXiv Detail & Related papers (2022-11-29T14:52:38Z) - MDFlow: Unsupervised Optical Flow Learning by Reliable Mutual Knowledge
Distillation [12.249680550252327]
Current approaches impose an augmentation regularization term for continual self-supervision.
We propose a novel mutual distillation framework to transfer reliable knowledge back and forth between the teacher and student networks.
Our approach, termed MDFlow, achieves state-of-the-art real-time accuracy and generalization ability on challenging benchmarks.
arXiv Detail & Related papers (2022-11-11T05:56:46Z) - Knowledge Distillation from A Stronger Teacher [44.11781464210916]
This paper presents a method dubbed DIST to distill better from a stronger teacher.
We empirically find that the discrepancy of predictions between the student and a stronger teacher may tend to be fairly severer.
Our method is simple yet practical, and extensive experiments demonstrate that it adapts well to various architectures.
arXiv Detail & Related papers (2022-05-21T08:30:58Z) - Deep Semi-supervised Knowledge Distillation for Overlapping Cervical
Cell Instance Segmentation [54.49894381464853]
We propose to leverage both labeled and unlabeled data for instance segmentation with improved accuracy by knowledge distillation.
We propose a novel Mask-guided Mean Teacher framework with Perturbation-sensitive Sample Mining.
Experiments show that the proposed method improves the performance significantly compared with the supervised method learned from labeled data only.
arXiv Detail & Related papers (2020-07-21T13:27:09Z) - Distilling Object Detectors with Task Adaptive Regularization [97.52935611385179]
Current state-of-the-art object detectors are at the expense of high computational costs and are hard to deploy to low-end devices.
Knowledge distillation, which aims at training a smaller student network by transferring knowledge from a larger teacher model, is one of the promising solutions for model miniaturization.
arXiv Detail & Related papers (2020-06-23T15:58:22Z) - Learning Diverse Representations for Fast Adaptation to Distribution
Shift [78.83747601814669]
We present a method for learning multiple models, incorporating an objective that pressures each to learn a distinct way to solve the task.
We demonstrate our framework's ability to facilitate rapid adaptation to distribution shift.
arXiv Detail & Related papers (2020-06-12T12:23:50Z) - Knowledge distillation via adaptive instance normalization [52.91164959767517]
We propose a new knowledge distillation method based on transferring feature statistics from the teacher to the student.
Our method goes beyond the standard way of enforcing the mean and variance of the student to be similar to those of the teacher.
We show that our distillation method outperforms other state-of-the-art distillation methods over a large set of experimental settings.
arXiv Detail & Related papers (2020-03-09T17:50:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.