Related papers: RdimKD: Generic Distillation Paradigm by Dimensionality Reduction

RdimKD: Generic Distillation Paradigm by Dimensionality Reduction

URL: http://arxiv.org/abs/2312.08700v1
Date: Thu, 14 Dec 2023 07:34:08 GMT
Title: RdimKD: Generic Distillation Paradigm by Dimensionality Reduction
Authors: Yi Guo, Yiqian He, Xiaoyang Li, Haotong Qin, Van Tung Pham, Yang Zhang, Shouda Liu
Abstract summary: Knowledge Distillation (KD) emerges as one of the most promising compression technologies to run advanced deep neural networks on resource-limited devices. In this work, we proposed an abstract and general paradigm for the KD task, referred to as DIMensionality Reduction KD (RdimKD) RdimKD solely relies on dimensionality reduction, with a very minor modification to naive L2 loss.
Score: 16.977144350795488
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Knowledge Distillation (KD) emerges as one of the most promising compression technologies to run advanced deep neural networks on resource-limited devices. In order to train a small network (student) under the guidance of a large network (teacher), the intuitive method is regularizing the feature maps or logits of the student using the teacher's information. However, existing methods either over-restrict the student to learn all information from the teacher, which lead to some bad local minimum, or use various fancy and elaborate modules to process and align features, which are complex and lack generality. In this work, we proposed an abstract and general paradigm for the KD task, referred to as DIMensionality Reduction KD (RdimKD), which solely relies on dimensionality reduction, with a very minor modification to naive L2 loss. RdimKD straightforwardly utilizes a projection matrix to project both the teacher's and student's feature maps onto a low-dimensional subspace, which are then optimized during training. RdimKD achieves the goal in the simplest way that not only does the student get valuable information from the teacher, but it also ensures sufficient flexibility to adapt to the student's low-capacity reality. Our extensive empirical findings indicate the effectiveness of RdimKD across various learning tasks and diverse network architectures.

Related papers

Knowledge Distillation: Enhancing Neural Network Compression with Integrated Gradients [0.0]
This paper proposes a machine learning framework that augments Knowledge Distillation (KD) with Integrated Gradients (IG) We introduce a novel data augmentation strategy where IG maps, precomputed from a teacher model, are overlaid onto training images to guide a compact student model toward critical feature representations. Experiments on CIFAR-10 demonstrate the efficacy of our method: a student model, compressed 4.1-fold from the MobileNet-V2 teacher, achieves 92.5% classification accuracy, surpassing the baseline student's 91.4% and traditional KD approaches, while reducing inference latency from 140 ms to 13 ms--a tenfold
arXiv Detail & Related papers (2025-03-17T10:07:50Z)
Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them. This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model. OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z)
Relative Difficulty Distillation for Semantic Segmentation [54.76143187709987]
We propose a pixel-level KD paradigm for semantic segmentation named Relative Difficulty Distillation (RDD) RDD allows the teacher network to provide effective guidance on learning focus without additional optimization goals. Our research showcases that RDD can integrate with existing KD methods to improve their upper performance bound.
arXiv Detail & Related papers (2024-07-04T08:08:25Z)
CES-KD: Curriculum-based Expert Selection for Guided Knowledge Distillation [4.182345120164705]
This paper proposes a new technique called Curriculum Expert Selection for Knowledge Distillation (CES-KD) CES-KD is built upon the hypothesis that a student network should be guided gradually using stratified teaching curriculum. Specifically, our method is a gradual TA-based KD technique that selects a single teacher per input image based on a curriculum driven by the difficulty in classifying the image.
arXiv Detail & Related papers (2022-09-15T21:02:57Z)
Knowledge Distillation with Representative Teacher Keys Based on Attention Mechanism for Image Classification Model Compression [1.503974529275767]
knowledge distillation (KD) has been recognized as one of the effective method of model compression to decrease the model parameters. Inspired by attention mechanism, we propose a novel KD method called representative teacher key (RTK) Our proposed RTK can effectively improve the classification accuracy of the state-of-the-art attention-based KD method.
arXiv Detail & Related papers (2022-06-26T05:08:50Z)
AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models [121.22644352431199]
We use Neural Architecture Search (NAS) to automatically distill several compressed students with variable cost from a large model. Current works train a single SuperLM consisting of millions ofworks with weight-sharing. Experiments on GLUE benchmark against state-of-the-art KD and NAS methods demonstrate AutoDistil to outperform leading compression techniques.
arXiv Detail & Related papers (2022-01-29T06:13:04Z)
Undistillable: Making A Nasty Teacher That CANNOT teach students [84.6111281091602]
This paper introduces and investigates a concept called Nasty Teacher: a specially trained teacher network that yields nearly the same performance as a normal one. We propose a simple yet effective algorithm to build the nasty teacher, called self-undermining knowledge distillation.
arXiv Detail & Related papers (2021-05-16T08:41:30Z)
Boosting Light-Weight Depth Estimation Via Knowledge Distillation [21.93879961636064]
We propose a lightweight network that can accurately estimate depth maps using minimal computing resources. We achieve this by designing a compact model architecture that maximally reduces model complexity. Our method achieves comparable performance to state-of-the-art methods while using only 1% of their parameters.
arXiv Detail & Related papers (2021-05-13T08:42:42Z)
Orderly Dual-Teacher Knowledge Distillation for Lightweight Human Pose Estimation [1.0323063834827415]
We propose an orderly dual-teacher knowledge distillation (ODKD) framework, which consists of two teachers with different capabilities. Taking dual-teacher together, an orderly learning strategy is proposed to promote knowledge absorbability. Our proposed ODKD can improve the performance of different lightweight models by a large margin, and HRNet-W16 equipped with ODKD achieves state-of-the-art performance for lightweight human pose estimation.
arXiv Detail & Related papers (2021-04-21T08:50:36Z)
Heterogeneous Knowledge Distillation using Information Flow Modeling [82.83891707250926]
We propose a novel KD method that works by modeling the information flow through the various layers of the teacher model. The proposed method is capable of overcoming the aforementioned limitations by using an appropriate supervision scheme during the different phases of the training process.
arXiv Detail & Related papers (2020-05-02T06:56:56Z)
Efficient Crowd Counting via Structured Knowledge Transfer [122.30417437707759]
Crowd counting is an application-oriented task and its inference efficiency is crucial for real-world applications. We propose a novel Structured Knowledge Transfer framework to generate a lightweight but still highly effective student network. Our models obtain at least 6.5$times$ speed-up on an Nvidia 1080 GPU and even achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-03-23T08:05:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.