Adapt Your Teacher: Improving Knowledge Distillation for Exemplar-free
Continual Learning
- URL: http://arxiv.org/abs/2308.09544v3
- Date: Sat, 4 Nov 2023 14:06:24 GMT
- Title: Adapt Your Teacher: Improving Knowledge Distillation for Exemplar-free
Continual Learning
- Authors: Filip Szatkowski, Mateusz Pyla, Marcin Przewi\k{e}\'zlikowski,
Sebastian Cygert, Bart{\l}omiej Twardowski, Tomasz Trzci\'nski
- Abstract summary: We investigate exemplar-free class incremental learning (CIL) with knowledge distillation (KD) as a regularization strategy.
KD-based methods are successfully used in CIL, but they often struggle to regularize the model without access to exemplars of the training data from previous tasks.
Inspired by recent test-time adaptation methods, we introduce Teacher Adaptation (TA), a method that concurrently updates the teacher and the main models during incremental training.
- Score: 14.379472108242235
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we investigate exemplar-free class incremental learning (CIL)
with knowledge distillation (KD) as a regularization strategy, aiming to
prevent forgetting. KD-based methods are successfully used in CIL, but they
often struggle to regularize the model without access to exemplars of the
training data from previous tasks. Our analysis reveals that this issue
originates from substantial representation shifts in the teacher network when
dealing with out-of-distribution data. This causes large errors in the KD loss
component, leading to performance degradation in CIL models. Inspired by recent
test-time adaptation methods, we introduce Teacher Adaptation (TA), a method
that concurrently updates the teacher and the main models during incremental
training. Our method seamlessly integrates with KD-based CIL approaches and
allows for consistent enhancement of their performance across multiple
exemplar-free CIL benchmarks. The source code for our method is available at
https://github.com/fszatkowski/cl-teacher-adaptation.
Related papers
- Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.
In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.
We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z) - Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Relational Representation Distillation [6.24302896438145]
We introduce Representation Distillation (RRD) to explore and reinforce relationships between teacher and student models.
Inspired by self-supervised learning principles, it uses a relaxed contrastive loss that focuses on similarity than exact replication.
Our approach demonstrates superior performance on CIFAR-100 and ImageNet ILSVRC-2012 and sometimes even outperforms the teacher network when combined with KD.
arXiv Detail & Related papers (2024-07-16T14:56:13Z) - Relative Difficulty Distillation for Semantic Segmentation [54.76143187709987]
We propose a pixel-level KD paradigm for semantic segmentation named Relative Difficulty Distillation (RDD)
RDD allows the teacher network to provide effective guidance on learning focus without additional optimization goals.
Our research showcases that RDD can integrate with existing KD methods to improve their upper performance bound.
arXiv Detail & Related papers (2024-07-04T08:08:25Z) - Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z) - On-Policy Distillation of Language Models: Learning from Self-Generated
Mistakes [44.97759066341107]
Generalized Knowledge Distillation (GKD) trains the student on its self-generated output sequences by leveraging feedback from the teacher.
We demonstrate the efficacy of GKD for distilling auto-regressive language models on summarization, translation, and arithmetic reasoning tasks.
arXiv Detail & Related papers (2023-06-23T17:56:26Z) - ALM-KD: Knowledge Distillation with noisy labels via adaptive loss
mixing [25.49637460661711]
Knowledge distillation is a technique where the outputs of a pretrained model are used for training a student model in a supervised setting.
We tackle this problem via the use of an adaptive loss mixing scheme during KD.
We demonstrate performance gains obtained using our approach in the standard KD setting as well as in multi-teacher and self-distillation settings.
arXiv Detail & Related papers (2022-02-07T14:53:22Z) - Confidence Conditioned Knowledge Distillation [8.09591217280048]
A confidence conditioned knowledge distillation (CCKD) scheme for transferring the knowledge from a teacher model to a student model is proposed.
CCKD addresses these issues by leveraging the confidence assigned by the teacher model to the correct class to devise sample-specific loss functions and targets.
Empirical evaluations on several benchmark datasets show that CCKD methods achieve at least as much generalization performance levels as other state-of-the-art methods.
arXiv Detail & Related papers (2021-07-06T00:33:25Z) - Always Be Dreaming: A New Approach for Data-Free Class-Incremental
Learning [73.24988226158497]
We consider the high-impact problem of Data-Free Class-Incremental Learning (DFCIL)
We propose a novel incremental distillation strategy for DFCIL, contributing a modified cross-entropy training and importance-weighted feature distillation.
Our method results in up to a 25.1% increase in final task accuracy (absolute difference) compared to SOTA DFCIL methods for common class-incremental benchmarks.
arXiv Detail & Related papers (2021-06-17T17:56:08Z) - Heterogeneous Knowledge Distillation using Information Flow Modeling [82.83891707250926]
We propose a novel KD method that works by modeling the information flow through the various layers of the teacher model.
The proposed method is capable of overcoming the aforementioned limitations by using an appropriate supervision scheme during the different phases of the training process.
arXiv Detail & Related papers (2020-05-02T06:56:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.