StableKD: Breaking Inter-block Optimization Entanglement for Stable Knowledge Distillation
- URL: http://arxiv.org/abs/2312.13223v2
- Date: Mon, 23 Sep 2024 14:37:35 GMT
- Title: StableKD: Breaking Inter-block Optimization Entanglement for Stable Knowledge Distillation
- Authors: Shiu-hong Kao, Jierun Chen, S. H. Gary Chan,
- Abstract summary: We propose StableKD, a novel KD framework that breaks the IBOE and achieves more stable optimization.
Compared to other KD approaches, our simple yet effective StableKD greatly boosts the model accuracy by 1% 18%, speeds up the convergence up to 10 times, and outperforms them with only 40% of the training data.
- Score: 11.0282391137938
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge distillation (KD) has been recognized as an effective tool to compress and accelerate models. However, current KD approaches generally suffer from an accuracy drop and/or an excruciatingly long distillation process. In this paper, we tackle the issue by first providing a new insight into a phenomenon that we call the Inter-Block Optimization Entanglement (IBOE), which makes the conventional end-to-end KD approaches unstable with noisy gradients. We then propose StableKD, a novel KD framework that breaks the IBOE and achieves more stable optimization. StableKD distinguishes itself through two operations: Decomposition and Recomposition, where the former divides a pair of teacher and student networks into several blocks for separate distillation, and the latter progressively merges them back, evolving towards end-to-end distillation. We conduct extensive experiments on CIFAR100, Imagewoof, and ImageNet datasets with various teacher-student pairs. Compared to other KD approaches, our simple yet effective StableKD greatly boosts the model accuracy by 1% ~ 18%, speeds up the convergence up to 10 times, and outperforms them with only 40% of the training data.
Related papers
- Robustness-Reinforced Knowledge Distillation with Correlation Distance
and Network Pruning [3.1423836318272773]
Knowledge distillation (KD) improves the performance of efficient and lightweight models.
Most existing KD techniques rely on Kullback-Leibler (KL) divergence.
We propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning.
arXiv Detail & Related papers (2023-11-23T11:34:48Z) - Grouped Knowledge Distillation for Deep Face Recognition [53.57402723008569]
The light-weight student network has difficulty fitting the target logits due to its low model capacity.
We propose a Grouped Knowledge Distillation (GKD) that retains the Primary-KD and Binary-KD but omits Secondary-KD in the ultimate KD loss calculation.
arXiv Detail & Related papers (2023-04-10T09:04:38Z) - BD-KD: Balancing the Divergences for Online Knowledge Distillation [11.874952582465601]
We introduce BD-KD (Balanced Divergence Knowledge Distillation), a framework for logit-based online KD.
BD-KD enhances both accuracy and model calibration simultaneously, eliminating the need for post-hoc recalibration techniques.
Our method encourages student-centered training by adjusting the conventional online distillation loss on both the student and teacher losses.
arXiv Detail & Related papers (2022-12-25T22:27:32Z) - Unbiased Knowledge Distillation for Recommendation [66.82575287129728]
Knowledge distillation (KD) has been applied in recommender systems (RS) to reduce inference latency.
Traditional solutions first train a full teacher model from the training data, and then transfer its knowledge to supervise the learning of a compact student model.
We find such a standard distillation paradigm would incur serious bias issue -- popular items are more heavily recommended after the distillation.
arXiv Detail & Related papers (2022-11-27T05:14:03Z) - Aligning Logits Generatively for Principled Black-Box Knowledge Distillation [49.43567344782207]
Black-Box Knowledge Distillation (B2KD) is a formulated problem for cloud-to-edge model compression with invisible data and models hosted on the server.
We formalize a two-step workflow consisting of deprivatization and distillation.
We propose a new method Mapping-Emulation KD (MEKD) that distills a black-box cumbersome model into a lightweight one.
arXiv Detail & Related papers (2022-05-21T02:38:16Z) - Self-Distillation from the Last Mini-Batch for Consistency
Regularization [14.388479145440636]
We propose an efficient and reliable self-distillation framework, named Self-Distillation from Last Mini-Batch (DLB)
Our proposed mechanism guides the training stability and consistency, resulting in robustness to label noise.
Experimental results on three classification benchmarks illustrate that our approach can consistently outperform state-of-the-art self-distillation approaches.
arXiv Detail & Related papers (2022-03-30T09:50:24Z) - Up to 100x Faster Data-free Knowledge Distillation [52.666615987503995]
We introduce FastDFKD, which allows us to accelerate DFKD by a factor of orders of magnitude.
Unlike prior methods that optimize a set of data independently, we propose to learn a meta-synthesizer that seeks common features.
FastDFKD achieves data synthesis within only a few steps, significantly enhancing the efficiency of data-free training.
arXiv Detail & Related papers (2021-12-12T14:56:58Z) - How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD)
We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy.
Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z) - Confidence Conditioned Knowledge Distillation [8.09591217280048]
A confidence conditioned knowledge distillation (CCKD) scheme for transferring the knowledge from a teacher model to a student model is proposed.
CCKD addresses these issues by leveraging the confidence assigned by the teacher model to the correct class to devise sample-specific loss functions and targets.
Empirical evaluations on several benchmark datasets show that CCKD methods achieve at least as much generalization performance levels as other state-of-the-art methods.
arXiv Detail & Related papers (2021-07-06T00:33:25Z) - Knowledge Distillation Thrives on Data Augmentation [65.58705111863814]
Knowledge distillation (KD) is a general deep neural network training framework that uses a teacher model to guide a student model.
Many works have explored the rationale for its success, however, its interplay with data augmentation (DA) has not been well recognized so far.
In this paper, we are motivated by an interesting observation in classification: KD loss can benefit from extended training iterations while the cross-entropy loss does not.
We show this disparity arises because of data augmentation: KD loss can tap into the extra information from different input views brought by DA.
arXiv Detail & Related papers (2020-12-05T00:32:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.