Aligning in a Compact Space: Contrastive Knowledge Distillation between Heterogeneous Architectures
- URL: http://arxiv.org/abs/2405.18524v1
- Date: Tue, 28 May 2024 18:44:42 GMT
- Title: Aligning in a Compact Space: Contrastive Knowledge Distillation between Heterogeneous Architectures
- Authors: Hongjun Wu, Li Xiao, Xingkuo Zhang, Yining Miao,
- Abstract summary: We propose a Low-Frequency Components-based Contrastive Knowledge Distillation (LFCC) framework that significantly enhances the performance of feature-based distillation.
Specifically, we designe a set of multi-scale low-pass filters to extract the low-frequency components of intermediate features from both the teacher and student models.
We show that LFCC achieves superior performance on the challenging benchmarks of ImageNet-1K and CIFAR-100.
- Score: 4.119589507611071
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge distillation is commonly employed to compress neural networks, reducing the inference costs and memory footprint. In the scenario of homogenous architecture, feature-based methods have been widely validated for their effectiveness. However, in scenarios where the teacher and student models are of heterogeneous architectures, the inherent differences in feature representation significantly degrade the performance of these methods. Recent studies have highlighted that low-frequency components constitute the majority of image features. Motivated by this, we propose a Low-Frequency Components-based Contrastive Knowledge Distillation (LFCC) framework that significantly enhances the performance of feature-based distillation between heterogeneous architectures. Specifically, we designe a set of multi-scale low-pass filters to extract the low-frequency components of intermediate features from both the teacher and student models, aligning them in a compact space to overcome architectural disparities. Moreover, leveraging the intrinsic pairing characteristic of the teacher-student framework, we design an innovative sample-level contrastive learning framework that adeptly restructures the constraints of within-sample feature similarity and between-sample feature divergence into a contrastive learning task. This strategy enables the student model to capitalize on intra-sample feature congruence while simultaneously enhancing the discrimination of features among disparate samples. Consequently, our LFCC framework accurately captures the commonalities in feature representation across heterogeneous architectures. Extensive evaluations and empirical analyses across three architectures (CNNs, Transformers, and MLPs) demonstrate that LFCC achieves superior performance on the challenging benchmarks of ImageNet-1K and CIFAR-100. All codes will be publicly available.
Related papers
- Systematic Characterization of Minimal Deep Learning Architectures: A Unified Analysis of Convergence, Pruning, and Quantization [6.49583548940407]
Deep learning networks excel at classification, yet identifying minimal architectures that reliably solve a task remains challenging.<n>We present a computational methodology for exploring and analyzing the relationships among convergence, pruning, and quantization.<n>Our initial results show that, despite architectural diversity, performance is largely invariant and learning dynamics consistently exhibit three regimes: unstable, learning, and overfitting.
arXiv Detail & Related papers (2026-01-25T20:31:10Z) - Implicit Neural Representation-Based Continuous Single Image Super Resolution: An Empirical Study [50.15623093332659]
Implicit neural representation (INR) has become the standard approach for arbitrary-scale image super-resolution (ASSR)<n>We compare existing techniques across diverse settings and present aggregated performance results on multiple image quality metrics.<n>We examine a new loss function that penalizes intensity variations while preserving edges, textures, and finer details during training.
arXiv Detail & Related papers (2026-01-25T07:09:20Z) - CSMCIR: CoT-Enhanced Symmetric Alignment with Memory Bank for Composed Image Retrieval [54.15776146365823]
Composed Image Retrieval (CIR) enables users to search for target images using both a reference image and manipulation text.<n>We propose CSMCIR, a unified representation framework that achieves efficient query-target alignment through three synergistic components.
arXiv Detail & Related papers (2026-01-07T09:21:38Z) - Deep Clustering with Associative Memories [28.46602918457589]
We propose a novel loss function utilizing energy-based dynamics via Associative Memories to formulate a new deep clustering method, DCAM.<n>Our experiments showcase the advantage of DCAM, producing improved clustering quality for various architecture choices.
arXiv Detail & Related papers (2026-01-02T19:21:06Z) - Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution [76.66229730098759]
In real-world image super-resolution (Real-ISR), existing approaches mainly rely on fine-tuning pre-trained diffusion models.<n>We propose a Mixture-of-Ranks (MoR) architecture for single-step image super-resolution.<n>We introduce a fine-grained expert partitioning strategy that treats each rank in LoRA as an independent expert.
arXiv Detail & Related papers (2025-11-20T04:11:44Z) - UHKD: A Unified Framework for Heterogeneous Knowledge Distillation via Frequency-Domain Representations [5.382357091398666]
Unified Heterogeneous Knowledge Distillation (UHKD) is proposed as a framework that leverages intermediate features in the frequency domain for cross-architecture transfer.<n>Experiments on CIFAR-100 and ImageNet-1K demonstrate gains of 5.59% and 0.83% over the latest method.
arXiv Detail & Related papers (2025-10-28T06:41:43Z) - Manifold-aware Representation Learning for Degradation-agnostic Image Restoration [135.90908995927194]
Image Restoration (IR) aims to recover high quality images from degraded inputs affected by various corruptions such as noise, blur, haze, rain, and low light conditions.<n>We present MIRAGE, a unified framework for all in one IR that explicitly decomposes the input feature space into three semantically aligned parallel branches.<n>This modular decomposition significantly improves generalization and efficiency across diverse degradations.
arXiv Detail & Related papers (2025-05-24T12:52:10Z) - Learning and Evaluating Hierarchical Feature Representations [3.770103075126785]
We propose a novel framework, Hierarchical Composition of Orthogonal Subspaces (Hier-COS)
Hier-COS learns to map deep feature embeddings into a vector space that is, by design, consistent with the structure of a given taxonomy tree.
We demonstrate that Hier-COS achieves state-of-the-art hierarchical performance across all the datasets while simultaneously beating top-1 accuracy in all but one case.
arXiv Detail & Related papers (2025-03-10T20:59:41Z) - Shortcut Learning Susceptibility in Vision Classifiers [3.004632712148892]
Shortcut learning is where machine learning models exploit spurious correlations in data instead of capturing meaningful features.
This phenomenon is prevalent across various machine learning applications, including vision, natural language processing, and speech recognition.
We systematically evaluate these architectures by introducing deliberate shortcuts into the dataset that are positionally correlated with class labels.
arXiv Detail & Related papers (2025-02-13T10:25:52Z) - Contrastive Representation Distillation via Multi-Scale Feature Decoupling [0.49157446832511503]
Knowledge distillation is a technique aimed at enhancing the performance of a small student network without increasing its parameter size.<n>We propose MSDCRD, a contrastive representation distillation approach that explicitly performs multi-scale decoupling within the feature space.<n>Our method achieves superior performance in homogeneous models but also enables efficient feature knowledge transfer across a variety of heterogeneous teacher-student pairs.
arXiv Detail & Related papers (2025-02-09T10:03:18Z) - Feature-based One-For-All: A Universal Framework for Heterogeneous Knowledge Distillation [28.722795943076306]
Knowledge distillation (KD) involves transferring knowledge from a pre-trained heavy teacher model to a lighter student model.
We introduce a feature-based one-for-all (FOFA) KD framework to enable feature distillation across diverse architecture.
Our framework comprises two key components. First, we design prompt tuning blocks that incorporate student feedback, allowing teacher features to adapt to the student model's learning process.
arXiv Detail & Related papers (2025-01-15T15:56:06Z) - TAS: Distilling Arbitrary Teacher and Student via a Hybrid Assistant [52.0297393822012]
We introduce an assistant model as a bridge to facilitate smooth feature knowledge transfer between heterogeneous teachers and students.
Within our proposed design principle, the assistant model combines the advantages of cross-architecture inductive biases and module functions.
Our proposed method is evaluated across some homogeneous model pairs and arbitrary heterogeneous combinations of CNNs, ViTs, spatial KDs.
arXiv Detail & Related papers (2024-10-16T08:02:49Z) - On Layer-wise Representation Similarity: Application for Multi-Exit Models with a Single Classifier [20.17288970927518]
We study the similarity of representations between the hidden layers of individual transformers.
We propose an aligned training approach to enhance the similarity between internal representations.
arXiv Detail & Related papers (2024-06-20T16:41:09Z) - One-for-All: Bridge the Gap Between Heterogeneous Architectures in
Knowledge Distillation [69.65734716679925]
Knowledge distillation has proven to be a highly effective approach for enhancing model performance through a teacher-student training scheme.
Most existing distillation methods are designed under the assumption that the teacher and student models belong to the same model family.
We propose a simple yet effective one-for-all KD framework called OFA-KD, which significantly improves the distillation performance between heterogeneous architectures.
arXiv Detail & Related papers (2023-10-30T11:13:02Z) - Enhancing Representations through Heterogeneous Self-Supervised Learning [61.40674648939691]
We propose Heterogeneous Self-Supervised Learning (HSSL), which enforces a base model to learn from an auxiliary head whose architecture is heterogeneous from the base model.
The HSSL endows the base model with new characteristics in a representation learning way without structural changes.
The HSSL is compatible with various self-supervised methods, achieving superior performances on various downstream tasks.
arXiv Detail & Related papers (2023-10-08T10:44:05Z) - Understanding the Robustness of Multi-modal Contrastive Learning to Distribution Shift [14.641747166801133]
multimodal contrastive learning approaches, such as CLIP, have achieved a remarkable success in learning representations that are robust against distribution shift.
We identify two mechanisms behind MMCL's robustness: emphintra-class contrasting and emphinter-class feature sharing.
We theoretically demonstrate the benefits of using rich captions on robustness and explore the effect of annotating different types of details in the captions.
arXiv Detail & Related papers (2023-10-08T02:25:52Z) - Ultra Sharp : Study of Single Image Super Resolution using Residual
Dense Network [0.15229257192293202]
Single Image Super Resolution (SISR) has been an interesting and ill-posed problem in computer vision.
Traditional super-resolution imaging approaches involve, reconstruction, and learning-based methods.
This study examines the Residual Dense Networks architecture proposed by Yhang et al.
arXiv Detail & Related papers (2023-04-21T10:32:24Z) - Demystify Transformers & Convolutions in Modern Image Deep Networks [82.32018252867277]
This paper aims to identify the real gains of popular convolution and attention operators through a detailed study.
We find that the key difference among these feature transformation modules, such as attention or convolution, lies in their spatial feature aggregation approach.
Our experiments on various tasks and an analysis of inductive bias show a significant performance boost due to advanced network-level and block-level designs.
arXiv Detail & Related papers (2022-11-10T18:59:43Z) - Spatio-Temporal Representation Factorization for Video-based Person
Re-Identification [55.01276167336187]
We propose Spatio-Temporal Representation Factorization module (STRF) for re-ID.
STRF is a flexible new computational unit that can be used in conjunction with most existing 3D convolutional neural network architectures for re-ID.
We empirically show that STRF improves performance of various existing baseline architectures while demonstrating new state-of-the-art results.
arXiv Detail & Related papers (2021-07-25T19:29:37Z) - Hierarchical Deep CNN Feature Set-Based Representation Learning for
Robust Cross-Resolution Face Recognition [59.29808528182607]
Cross-resolution face recognition (CRFR) is important in intelligent surveillance and biometric forensics.
Existing shallow learning-based and deep learning-based methods focus on mapping the HR-LR face pairs into a joint feature space.
In this study, we desire to fully exploit the multi-level deep convolutional neural network (CNN) feature set for robust CRFR.
arXiv Detail & Related papers (2021-03-25T14:03:42Z) - Out-of-distribution Generalization via Partial Feature Decorrelation [72.96261704851683]
We present a novel Partial Feature Decorrelation Learning (PFDL) algorithm, which jointly optimize a feature decomposition network and the target image classification model.
The experiments on real-world datasets demonstrate that our method can improve the backbone model's accuracy on OOD image classification datasets.
arXiv Detail & Related papers (2020-07-30T05:48:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.