Related papers: Cross-Architecture Distillation Made Simple with Redundancy Suppression

Cross-Architecture Distillation Made Simple with Redundancy Suppression

URL: http://arxiv.org/abs/2507.21844v1
Date: Tue, 29 Jul 2025 14:21:40 GMT
Title: Cross-Architecture Distillation Made Simple with Redundancy Suppression
Authors: Weijia Zhang, Yuehao Liu, Wu Ran, Chao Ma,
Abstract summary: We describe a simple method for cross-architecture knowledge distillation, where the knowledge transfer is cast into a redundant information suppression formulation.<n>We propose to extract the architecture-agnostic knowledge in heterogeneous representations by reducing the redundant architecture-exclusive information.<n>Our method is devoid of the architecture-specific designs and complex operations in the pioneering method of OFA.
Score: 8.844066299737845
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We describe a simple method for cross-architecture knowledge distillation, where the knowledge transfer is cast into a redundant information suppression formulation. Existing methods introduce sophisticated modules, architecture-tailored designs, and excessive parameters, which impair their efficiency and applicability. We propose to extract the architecture-agnostic knowledge in heterogeneous representations by reducing the redundant architecture-exclusive information. To this end, we present a simple redundancy suppression distillation (RSD) loss, which comprises cross-architecture invariance maximisation and feature decorrelation objectives. To prevent the student from entirely losing its architecture-specific capabilities, we further design a lightweight module that decouples the RSD objective from the student's internal representations. Our method is devoid of the architecture-specific designs and complex operations in the pioneering method of OFA. It outperforms OFA on CIFAR-100 and ImageNet-1k benchmarks with only a fraction of their parameter overhead, which highlights its potential as a simple and strong baseline to the cross-architecture distillation community.

Related papers

Large-Scale Model Enabled Semantic Communication Based on Robust Knowledge Distillation [53.16213723669751]
Large-scale models (LSMs) can be an effective framework for semantic representation and understanding.<n>However, their direct deployment is often hindered by high computational complexity and resource requirements.<n>This paper proposes a novel knowledge distillation based semantic communication framework.
arXiv Detail & Related papers (2025-08-04T07:47:18Z)
Empirical Evaluation of Knowledge Distillation from Transformers to Subquadratic Language Models [3.287942619833188]
We systematically evaluate the transferability of knowledge distillation from a Transformer teacher model to eight subquadratic student architectures.<n>Our study investigates which subquadratic model can most effectively approximate the teacher model's learned representations through knowledge distillation.
arXiv Detail & Related papers (2025-04-19T17:49:52Z)
Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation [158.37640586809187]
Restoring any degraded image efficiently via just one model has become increasingly significant.<n>Our approach, termed AnyIR, takes a unified path that leverages inherent similarity across various degradations.<n>To fuse the degradation awareness and the contextualized attention, a spatial-frequency parallel fusion strategy is proposed.
arXiv Detail & Related papers (2025-04-19T09:54:46Z)
Relation Extraction with Instance-Adapted Predicate Descriptions [9.021267901894912]
Relation extraction plays a major role in downstream applications such as knowledge discovery and question answering.<n>In this paper, we revisit fine-tuning such smaller models using a novel dual-encoder architecture with a joint contrastive and cross-entropy loss.<n>Our approach achieved F1 score improvements ranging from 1% to 2% over state-of-the-art methods with a simple but elegant formulation.
arXiv Detail & Related papers (2025-03-22T15:36:41Z)
A Lightweight Deep Exclusion Unfolding Network for Single Image Reflection Removal [68.0573194557999]
Single Image Reflection Removal (SIRR) is a canonical blind source separation problem.<n>We propose a novel Deep Exclusion unfolding Network (DExNet) for SIRR.<n>DExNet is constructed by unfolding and parameterizing a simple iterative Sparse and Auxiliary Feature Update (i-SAFU) algorithm.
arXiv Detail & Related papers (2025-03-03T07:54:27Z)
An Architectural Approach to Enhance Deep Long-Tailed Learning [5.214135587370722]
We introduce Long-Tailed Differential Architecture Search (LTDAS)<n>We conduct extensive experiments to explore architectural components that demonstrate better performance on long-tailed data.<n>This ensures that the architecture obtained through our search process incorporates superior components.
arXiv Detail & Related papers (2024-11-09T07:19:56Z)
One-for-All: Bridge the Gap Between Heterogeneous Architectures in Knowledge Distillation [69.65734716679925]
Knowledge distillation has proven to be a highly effective approach for enhancing model performance through a teacher-student training scheme. Most existing distillation methods are designed under the assumption that the teacher and student models belong to the same model family. We propose a simple yet effective one-for-all KD framework called OFA-KD, which significantly improves the distillation performance between heterogeneous architectures.
arXiv Detail & Related papers (2023-10-30T11:13:02Z)
Heterogeneous Continual Learning [88.53038822561197]
We propose a novel framework to tackle the continual learning (CL) problem with changing network architectures. We build on top of the distillation family of techniques and modify it to a new setting where a weaker model takes the role of a teacher. We also propose Quick Deep Inversion (QDI) to recover prior task visual features to support knowledge transfer.
arXiv Detail & Related papers (2023-06-14T15:54:42Z)
$\alpha$ DARTS Once More: Enhancing Differentiable Architecture Search by Masked Image Modeling [25.75814720792934]
Differentiable architecture search (DARTS) has been a mainstream direction in automatic machine learning. We propose to additionally inject semantic information by formulating a patch recovery approach. Our method surpasses all previous DARTS variants and achieves state-of-the-art results on CIFAR-10, CIFAR-100, and ImageNet.
arXiv Detail & Related papers (2022-11-18T09:07:19Z)
Pixel Distillation: A New Knowledge Distillation Scheme for Low-Resolution Image Recognition [124.80263629921498]
We propose Pixel Distillation that extends knowledge distillation into the input level while simultaneously breaking architecture constraints. Such a scheme can achieve flexible cost control for deployment, as it allows the system to adjust both network architecture and image quality according to the overall requirement of resources.
arXiv Detail & Related papers (2021-12-17T14:31:40Z)
Multi-Stage Progressive Image Restoration [167.6852235432918]
We propose a novel synergistic design that can optimally balance these competing goals. Our main proposal is a multi-stage architecture, that progressively learns restoration functions for the degraded inputs. The resulting tightly interlinked multi-stage architecture, named as MPRNet, delivers strong performance gains on ten datasets.
arXiv Detail & Related papers (2021-02-04T18:57:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.