Expandable Residual Approximation for Knowledge Distillation
- URL: http://arxiv.org/abs/2508.16050v1
- Date: Fri, 22 Aug 2025 02:57:13 GMT
- Title: Expandable Residual Approximation for Knowledge Distillation
- Authors: Zhaoyi Yan, Binghui Chen, Yunfan Liu, Qixiang Ye,
- Abstract summary: Knowledge distillation aims to transfer knowledge from a large-scale teacher model to a lightweight one.<n>The inherent learning capacity gap between the teacher and student often hinders the sufficient transfer of knowledge.<n>We propose Expandable Residual Approximation (ERA), a novel KD method that decomposes the approximation of residual knowledge into multiple steps.
- Score: 44.146649875415754
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge distillation (KD) aims to transfer knowledge from a large-scale teacher model to a lightweight one, significantly reducing computational and storage requirements. However, the inherent learning capacity gap between the teacher and student often hinders the sufficient transfer of knowledge, motivating numerous studies to address this challenge. Inspired by the progressive approximation principle in the Stone-Weierstrass theorem, we propose Expandable Residual Approximation (ERA), a novel KD method that decomposes the approximation of residual knowledge into multiple steps, reducing the difficulty of mimicking the teacher's representation through a divide-and-conquer approach. Specifically, ERA employs a Multi-Branched Residual Network (MBRNet) to implement this residual knowledge decomposition. Additionally, a Teacher Weight Integration (TWI) strategy is introduced to mitigate the capacity disparity by reusing the teacher's head weights. Extensive experiments show that ERA improves the Top-1 accuracy on the ImageNet classification benchmark by 1.41% and the AP on the MS COCO object detection benchmark by 1.40, as well as achieving leading performance across computer vision tasks. Codes and models are available at https://github.com/Zhaoyi-Yan/ERA.
Related papers
- Rethinking Decoupled Knowledge Distillation: A Predictive Distribution Perspective [9.10299144143817]
Decoupled Knowledge Distillation (DKD) re-emphasizes the importance of logit knowledge through advanced decoupling and strategies.<n>We introduce an enhanced version, the Generalized Decoupled Knowledge Distillation (GDKD) loss.<n>We demonstrate GDKD's superior performance over both the original DKD and other leading knowledge distillation methods.
arXiv Detail & Related papers (2025-12-04T09:56:25Z) - Cross-View Consistency Regularisation for Knowledge Distillation [13.918476599394603]
This work is inspired by the success of cross-view learning in fields such as semi-supervised learning.<n>We introduce within-view and cross-view regularisations to standard logit-based distillation frameworks.<n>We also perform confidence-based soft label mining to improve the quality of distilling signals from the teacher.
arXiv Detail & Related papers (2024-12-21T05:41:47Z) - Relative Difficulty Distillation for Semantic Segmentation [54.76143187709987]
We propose a pixel-level KD paradigm for semantic segmentation named Relative Difficulty Distillation (RDD)
RDD allows the teacher network to provide effective guidance on learning focus without additional optimization goals.
Our research showcases that RDD can integrate with existing KD methods to improve their upper performance bound.
arXiv Detail & Related papers (2024-07-04T08:08:25Z) - Robustness-Reinforced Knowledge Distillation with Correlation Distance and Network Pruning [3.1423836318272773]
Knowledge distillation (KD) improves the performance of efficient and lightweight models.<n>Most existing KD techniques rely on Kullback-Leibler (KL) divergence.<n>We propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning.
arXiv Detail & Related papers (2023-11-23T11:34:48Z) - Learning Lightweight Object Detectors via Multi-Teacher Progressive
Distillation [56.053397775016755]
We propose a sequential approach to knowledge distillation that progressively transfers the knowledge of a set of teacher detectors to a given lightweight student.
To the best of our knowledge, we are the first to successfully distill knowledge from Transformer-based teacher detectors to convolution-based students.
arXiv Detail & Related papers (2023-08-17T17:17:08Z) - Knowledge Distillation with Deep Supervision [6.8080936803807734]
We propose Deeply-Supervised Knowledge Distillation (DSKD), which fully utilizes class predictions and feature maps of the teacher model to supervise the training of shallow student layers.
A loss-based weight allocation strategy is developed in DSKD to adaptively balance the learning process of each shallow layer, so as to further improve the student performance.
arXiv Detail & Related papers (2022-02-16T03:58:21Z) - Online Knowledge Distillation for Efficient Pose Estimation [37.81478634850458]
We investigate a novel Online Knowledge Distillation framework by distilling Human Pose structure knowledge in a one-stage manner.
OKDHP trains a single multi-branch network and acquires the predicted heatmaps from each.
The pixel-wise Kullback-Leibler divergence is utilized to minimize the discrepancy between the target heatmaps and the predicted ones.
arXiv Detail & Related papers (2021-08-04T14:49:44Z) - Towards Accurate Knowledge Transfer via Target-awareness Representation
Disentanglement [56.40587594647692]
We propose a novel transfer learning algorithm, introducing the idea of Target-awareness REpresentation Disentanglement (TRED)
TRED disentangles the relevant knowledge with respect to the target task from the original source model and used as a regularizer during fine-tuning the target model.
Experiments on various real world datasets show that our method stably improves the standard fine-tuning by more than 2% in average.
arXiv Detail & Related papers (2020-10-16T17:45:08Z) - Knowledge Distillation Meets Self-Supervision [109.6400639148393]
Knowledge distillation involves extracting "dark knowledge" from a teacher network to guide the learning of a student network.
We show that the seemingly different self-supervision task can serve as a simple yet powerful solution.
By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student.
arXiv Detail & Related papers (2020-06-12T12:18:52Z) - Residual Knowledge Distillation [96.18815134719975]
This work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A)
In this way, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them.
Experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet.
arXiv Detail & Related papers (2020-02-21T07:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.