Related papers: Multi-Level Decoupled Relational Distillation for Heterogeneous Architectures

Multi-Level Decoupled Relational Distillation for Heterogeneous Architectures

URL: http://arxiv.org/abs/2502.06189v1
Date: Mon, 10 Feb 2025 06:41:20 GMT
Title: Multi-Level Decoupled Relational Distillation for Heterogeneous Architectures
Authors: Yaoxin Yang, Peng Ye, Weihao Lin, Kangcong Li, Yan Wen, Jia Hao, Tao Chen,
Abstract summary: Multi-Level Decoupled Knowledge Distillation (MLDR-KD) improves student model performance with gains of up to 4.86% on CodeAR-100 and 2.78% on Tiny-ImageNet datasets respectively.
Score: 6.231548250160585
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Heterogeneous distillation is an effective way to transfer knowledge from cross-architecture teacher models to student models. However, existing heterogeneous distillation methods do not take full advantage of the dark knowledge hidden in the teacher's output, limiting their performance.To this end, we propose a novel framework named Multi-Level Decoupled Relational Knowledge Distillation (MLDR-KD) to unleash the potential of relational distillation in heterogeneous distillation. Concretely, we first introduce Decoupled Finegrained Relation Alignment (DFRA) in both logit and feature levels to balance the trade-off between distilled dark knowledge and the confidence in the correct category of the heterogeneous teacher model. Then, Multi-Scale Dynamic Fusion (MSDF) module is applied to dynamically fuse the projected logits of multiscale features at different stages in student model, further improving performance of our method in feature level. We verify our method on four architectures (CNNs, Transformers, MLPs and Mambas), two datasets (CIFAR-100 and Tiny-ImageNet). Compared with the best available method, our MLDR-KD improves student model performance with gains of up to 4.86% on CIFAR-100 and 2.78% on Tiny-ImageNet datasets respectively, showing robustness and generality in heterogeneous distillation. Code will be released soon.

Related papers

Quantification of Large Language Model Distillation [22.680566179355335]
We propose a framework to evaluate and quantify model distillation. Our method addresses two key aspects: (1) Identifying identity cognition contradictions to assess discrepancies in how models perceive and represent identity-related information, and (2) Analyzing multi-granularity response similarities across models to measure the extent of homogenization.
arXiv Detail & Related papers (2025-01-22T03:57:52Z)
TAS: Distilling Arbitrary Teacher and Student via a Hybrid Assistant [52.0297393822012]
We introduce an assistant model as a bridge to facilitate smooth feature knowledge transfer between heterogeneous teachers and students. Within our proposed design principle, the assistant model combines the advantages of cross-architecture inductive biases and module functions. Our proposed method is evaluated across some homogeneous model pairs and arbitrary heterogeneous combinations of CNNs, ViTs, spatial KDs.
arXiv Detail & Related papers (2024-10-16T08:02:49Z)
AMD: Automatic Multi-step Distillation of Large-scale Vision Models [39.70559487432038]
We present a novel approach named Automatic Multi-step Distillation (AMD) for large-scale vision model compression. An efficient and effective optimization framework is introduced to automatically identify the optimal teacher-assistant that leads to the maximal student performance.
arXiv Detail & Related papers (2024-07-05T01:35:42Z)
One-for-All: Bridge the Gap Between Heterogeneous Architectures in Knowledge Distillation [69.65734716679925]
Knowledge distillation has proven to be a highly effective approach for enhancing model performance through a teacher-student training scheme. Most existing distillation methods are designed under the assumption that the teacher and student models belong to the same model family. We propose a simple yet effective one-for-all KD framework called OFA-KD, which significantly improves the distillation performance between heterogeneous architectures.
arXiv Detail & Related papers (2023-10-30T11:13:02Z)
EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR) We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model. We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z)
Aligning Logits Generatively for Principled Black-Box Knowledge Distillation [49.43567344782207]
Black-Box Knowledge Distillation (B2KD) is a formulated problem for cloud-to-edge model compression with invisible data and models hosted on the server. We formalize a two-step workflow consisting of deprivatization and distillation. We propose a new method Mapping-Emulation KD (MEKD) that distills a black-box cumbersome model into a lightweight one.
arXiv Detail & Related papers (2022-05-21T02:38:16Z)
ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self On-the-fly Distillation for Dense Passage Retrieval [54.54667085792404]
We propose a novel distillation method that significantly advances cross-architecture distillation for dual-encoders. Our method 1) introduces a self on-the-fly distillation method that can effectively distill late interaction (i.e., ColBERT) to vanilla dual-encoder, and 2) incorporates a cascade distillation process to further improve the performance with a cross-encoder teacher.
arXiv Detail & Related papers (2022-05-18T18:05:13Z)
Efficient Vision Transformers via Fine-Grained Manifold Distillation [96.50513363752836]
Vision transformer architectures have shown extraordinary performance on many computer vision tasks. Although the network performance is boosted, transformers are often required more computational resources. We propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches.
arXiv Detail & Related papers (2021-07-03T08:28:34Z)
Online Knowledge Distillation via Multi-branch Diversity Enhancement [15.523646047674717]
We propose a new distillation method to enhance the diversity among multiple student models. We use Feature Fusion Module (FFM), which improves the performance of the attention mechanism in the network. We also use Diversification(CD) loss function to strengthen the differences between the student models.
arXiv Detail & Related papers (2020-10-02T05:52:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.