Related papers: Adaptive Teaching with Shared Classifier for Knowledge Distillation

Adaptive Teaching with Shared Classifier for Knowledge Distillation

URL: http://arxiv.org/abs/2406.08528v2
Date: Fri, 14 Jun 2024 08:19:28 GMT
Title: Adaptive Teaching with Shared Classifier for Knowledge Distillation
Authors: Jaeyeon Jang, Young-Ik Kim, Jisu Lim, Hyeonseong Lee,
Abstract summary: Knowledge distillation (KD) is a technique used to transfer knowledge from a teacher network to a student network. We propose adaptive teaching with a shared classifier (ATSC) Our approach achieves state-of-the-art results on the CIFAR-100 and ImageNet datasets in both single-teacher and multiteacher scenarios.
Score: 6.03477652126575
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Knowledge distillation (KD) is a technique used to transfer knowledge from an overparameterized teacher network to a less-parameterized student network, thereby minimizing the incurred performance loss. KD methods can be categorized into offline and online approaches. Offline KD leverages a powerful pretrained teacher network, while online KD allows the teacher network to be adjusted dynamically to enhance the learning effectiveness of the student network. Recently, it has been discovered that sharing the classifier of the teacher network can significantly boost the performance of the student network with only a minimal increase in the number of network parameters. Building on these insights, we propose adaptive teaching with a shared classifier (ATSC). In ATSC, the pretrained teacher network self-adjusts to better align with the learning needs of the student network based on its capabilities, and the student network benefits from the shared classifier, enhancing its performance. Additionally, we extend ATSC to environments with multiple teachers. We conduct extensive experiments, demonstrating the effectiveness of the proposed KD method. Our approach achieves state-of-the-art results on the CIFAR-100 and ImageNet datasets in both single-teacher and multiteacher scenarios, with only a modest increase in the number of required model parameters. The source code is publicly available at https://github.com/random2314235/ATSC.

Related papers

Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them. This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model. OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z)
Learn from Balance: Rectifying Knowledge Transfer for Long-Tailed Scenarios [8.804625474114948]
Knowledge Distillation (KD) transfers knowledge from a large pre-trained teacher network to a compact and efficient student network. We propose a novel framework called Knowledge Rectification Distillation (KRDistill) to address the imbalanced knowledge inherited in the teacher network.
arXiv Detail & Related papers (2024-09-12T01:58:06Z)
BD-KD: Balancing the Divergences for Online Knowledge Distillation [12.27903419909491]
We propose BD-KD: Balancing of Divergences for online Knowledge Distillation. We show that adaptively balancing between the reverse and forward divergences shifts the focus of the training strategy to the compact student network. We demonstrate that, by performing this balancing design at the level of the student distillation loss, we improve upon both performance accuracy and calibration of the compact student network.
arXiv Detail & Related papers (2022-12-25T22:27:32Z)
CES-KD: Curriculum-based Expert Selection for Guided Knowledge Distillation [4.182345120164705]
This paper proposes a new technique called Curriculum Expert Selection for Knowledge Distillation (CES-KD) CES-KD is built upon the hypothesis that a student network should be guided gradually using stratified teaching curriculum. Specifically, our method is a gradual TA-based KD technique that selects a single teacher per input image based on a curriculum driven by the difficulty in classifying the image.
arXiv Detail & Related papers (2022-09-15T21:02:57Z)
Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation. Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z)
Augmenting Knowledge Distillation With Peer-To-Peer Mutual Learning For Model Compression [2.538209532048867]
Mutual Learning (ML) provides an alternative strategy where multiple simple student networks benefit from sharing knowledge. We propose a single-teacher, multi-student framework that leverages both KD and ML to achieve better performance.
arXiv Detail & Related papers (2021-10-21T09:59:31Z)
Densely Guided Knowledge Distillation using Multiple Teacher Assistants [5.169724825219126]
We propose a densely guided knowledge distillation using multiple teacher assistants that gradually decreases the model size. We also design teaching where, for each mini-batch, a teacher or teacher assistants are randomly dropped. This acts as a regularizer to improve the efficiency of teaching of the student network.
arXiv Detail & Related papers (2020-09-18T13:12:52Z)
Point Adversarial Self Mining: A Simple Method for Facial Expression Recognition [79.75964372862279]
We propose Point Adversarial Self Mining (PASM) to improve the recognition accuracy in facial expression recognition. PASM uses a point adversarial attack method and a trained teacher network to locate the most informative position related to the target task. The adaptive learning materials generation and teacher/student update can be conducted more than one time, improving the network capability iteratively.
arXiv Detail & Related papers (2020-08-26T06:39:24Z)
Knowledge Transfer via Dense Cross-Layer Mutual-Distillation [24.24969126783315]
We propose Dense Cross-layer Mutual-distillation (DCM) in which the teacher and student networks are trained collaboratively from scratch. To boost KT performance, we introduce dense bidirectional KD operations between the layers with appended classifiers. We test our method on a variety of KT tasks, showing its superiorities over related methods.
arXiv Detail & Related papers (2020-08-18T09:25:08Z)
Interactive Knowledge Distillation [79.12866404907506]
We propose an InterActive Knowledge Distillation scheme to leverage the interactive teaching strategy for efficient knowledge distillation. In the distillation process, the interaction between teacher and student networks is implemented by a swapping-in operation. Experiments with typical settings of teacher-student networks demonstrate that the student networks trained by our IAKD achieve better performance than those trained by conventional knowledge distillation methods.
arXiv Detail & Related papers (2020-07-03T03:22:04Z)
Heterogeneous Knowledge Distillation using Information Flow Modeling [82.83891707250926]
We propose a novel KD method that works by modeling the information flow through the various layers of the teacher model. The proposed method is capable of overcoming the aforementioned limitations by using an appropriate supervision scheme during the different phases of the training process.
arXiv Detail & Related papers (2020-05-02T06:56:56Z)
Efficient Crowd Counting via Structured Knowledge Transfer [122.30417437707759]
Crowd counting is an application-oriented task and its inference efficiency is crucial for real-world applications. We propose a novel Structured Knowledge Transfer framework to generate a lightweight but still highly effective student network. Our models obtain at least 6.5$times$ speed-up on an Nvidia 1080 GPU and even achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-03-23T08:05:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.