Estimating and Maximizing Mutual Information for Knowledge Distillation
- URL: http://arxiv.org/abs/2110.15946v3
- Date: Thu, 11 May 2023 13:08:01 GMT
- Title: Estimating and Maximizing Mutual Information for Knowledge Distillation
- Authors: Aman Shrivastava, Yanjun Qi, Vicente Ordonez
- Abstract summary: We propose Mutual Information Maximization Knowledge Distillation (MIMKD)
Our method uses a contrastive objective to simultaneously estimate and maximize a lower bound on the mutual information of local and global feature representations between a teacher and a student network.
This can be used to improve the performance of low capacity models by transferring knowledge from more performant but computationally expensive models.
- Score: 24.254198219979667
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we propose Mutual Information Maximization Knowledge
Distillation (MIMKD). Our method uses a contrastive objective to simultaneously
estimate and maximize a lower bound on the mutual information of local and
global feature representations between a teacher and a student network. We
demonstrate through extensive experiments that this can be used to improve the
performance of low capacity models by transferring knowledge from more
performant but computationally expensive models. This can be used to produce
better models that can be run on devices with low computational resources. Our
method is flexible, we can distill knowledge from teachers with arbitrary
network architectures to arbitrary student networks. Our empirical results show
that MIMKD outperforms competing approaches across a wide range of
student-teacher pairs with different capacities, with different architectures,
and when student networks are with extremely low capacity. We are able to
obtain 74.55% accuracy on CIFAR100 with a ShufflenetV2 from a baseline accuracy
of 69.8% by distilling knowledge from ResNet-50. On Imagenet we improve a
ResNet-18 network from 68.88% to 70.32% accuracy (1.44%+) using a ResNet-34
teacher network.
Related papers
- Semantic Knowledge Distillation for Onboard Satellite Earth Observation Image Classification [28.08042498882207]
This study presents an innovative dynamic weighting knowledge distillation (KD) framework tailored for efficient Earth observation (EO) image classification (IC) in resource-constrained settings.
Our framework enables lightweight student models to surpass 90% in accuracy, precision, and recall, adhering to the stringent confidence thresholds necessary for reliable classification tasks.
Remarkably, ResNet8 delivers substantial efficiency gains, achieving a 97.5% reduction in parameters, a 96.7% decrease in FLOPs, an 86.2% cut in power consumption, and a 63.5% increase in inference speed over MobileViT.
arXiv Detail & Related papers (2024-10-31T21:13:40Z) - Knowledge Distillation of LLM for Automatic Scoring of Science Education Assessments [4.541309099803903]
This study proposes a method for knowledge distillation (KD) of fine-tuned Large Language Models (LLMs)
We specifically target the challenge of deploying these models on resource-constrained devices.
arXiv Detail & Related papers (2023-12-26T01:24:25Z) - Learning Lightweight Object Detectors via Multi-Teacher Progressive
Distillation [56.053397775016755]
We propose a sequential approach to knowledge distillation that progressively transfers the knowledge of a set of teacher detectors to a given lightweight student.
To the best of our knowledge, we are the first to successfully distill knowledge from Transformer-based teacher detectors to convolution-based students.
arXiv Detail & Related papers (2023-08-17T17:17:08Z) - A Light-weight Deep Learning Model for Remote Sensing Image
Classification [70.66164876551674]
We present a high-performance and light-weight deep learning model for Remote Sensing Image Classification (RSIC)
By conducting extensive experiments on the NWPU-RESISC45 benchmark, our proposed teacher-student models outperforms the state-of-the-art systems.
arXiv Detail & Related papers (2023-02-25T09:02:01Z) - Learning Knowledge Representation with Meta Knowledge Distillation for
Single Image Super-Resolution [82.89021683451432]
We propose a model-agnostic meta knowledge distillation method under the teacher-student architecture for the single image super-resolution task.
Experiments conducted on various single image super-resolution datasets demonstrate that our proposed method outperforms existing defined knowledge representation related distillation methods.
arXiv Detail & Related papers (2022-07-18T02:41:04Z) - Student Helping Teacher: Teacher Evolution via Self-Knowledge
Distillation [20.17325172100031]
We propose a novel student-helping-teacher formula, Teacher Evolution via Self-Knowledge Distillation (TESKD), where the target teacher is learned with the help of multiple hierarchical students by sharing the structural backbone.
The effectiveness of our proposed framework is demonstrated by extensive experiments with various network settings on two standard benchmarks including CIFAR-100 and ImageNet.
arXiv Detail & Related papers (2021-10-01T11:46:12Z) - LGD: Label-guided Self-distillation for Object Detection [59.9972914042281]
We propose the first self-distillation framework for general object detection, termed LGD (Label-Guided self-Distillation)
Our framework involves sparse label-appearance encoding, inter-object relation adaptation and intra-object knowledge mapping to obtain the instructive knowledge.
Compared with a classical teacher-based method FGFI, LGD not only performs better without requiring pretrained teacher but also with 51% lower training cost beyond inherent student learning.
arXiv Detail & Related papers (2021-09-23T16:55:01Z) - Spirit Distillation: A Model Compression Method with Multi-domain
Knowledge Transfer [5.0919090307185035]
We propose a new knowledge distillation model, named Spirit Distillation (SD), which is a model compression method with multi-domain knowledge transfer.
Results demonstrate that our method can boost mIOU and high-precision accuracy by 1.4% and 8.2% respectively with 78.2% segmentation variance.
arXiv Detail & Related papers (2021-04-29T23:19:51Z) - DisCo: Remedy Self-supervised Learning on Lightweight Models with
Distilled Contrastive Learning [94.89221799550593]
Self-supervised representation learning (SSL) has received widespread attention from the community.
Recent research argue that its performance will suffer a cliff fall when the model size decreases.
We propose a simple yet effective Distilled Contrastive Learning (DisCo) to ease the issue by a large margin.
arXiv Detail & Related papers (2021-04-19T08:22:52Z) - Computation-Efficient Knowledge Distillation via Uncertainty-Aware Mixup [91.1317510066954]
We study a little-explored but important question, i.e., knowledge distillation efficiency.
Our goal is to achieve a performance comparable to conventional knowledge distillation with a lower computation cost during training.
We show that the UNcertainty-aware mIXup (UNIX) can serve as a clean yet effective solution.
arXiv Detail & Related papers (2020-12-17T06:52:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.