Related papers: Contrastive Representation Distillation via Multi-Scale Feature Decoupling

Contrastive Representation Distillation via Multi-Scale Feature Decoupling

URL: http://arxiv.org/abs/2502.05835v2
Date: Thu, 05 Jun 2025 03:15:28 GMT
Title: Contrastive Representation Distillation via Multi-Scale Feature Decoupling
Authors: Cuipeng Wang, Tieyuan Chen, Haipeng Wang,
Abstract summary: Knowledge distillation is a technique aimed at enhancing the performance of a small student network without increasing its parameter size.<n>We propose MSDCRD, a contrastive representation distillation approach that explicitly performs multi-scale decoupling within the feature space.<n>Our method achieves superior performance in homogeneous models but also enables efficient feature knowledge transfer across a variety of heterogeneous teacher-student pairs.
Score: 0.49157446832511503
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Knowledge distillation is a technique aimed at enhancing the performance of a small student network without increasing its parameter size by transferring knowledge from a large, pre-trained teacher network. In the feature space, different local regions within an individual global feature map often encode distinct yet interdependent semantic information. However, previous methods mainly focus on transferring global feature knowledge, neglecting the decoupling of interdependent local regions within an individual global feature, which often results in suboptimal performance. To address this limitation, we propose MSDCRD, a novel contrastive representation distillation approach that explicitly performs multi-scale decoupling within the feature space. MSDCRD employs a multi-scale sliding-window pooling approach within the feature space to capture representations at various granularities effectively. This, in conjunction with sample categorization, facilitates efficient multi-scale feature decoupling. When integrated with a novel and effective contrastive loss function, this forms the core of MSDCRD. Feature representations differ significantly across network architectures, and this divergence becomes more pronounced in heterogeneous models, rendering feature distillation particularly challenging. Despite this, our method not only achieves superior performance in homogeneous models but also enables efficient feature knowledge transfer across a variety of heterogeneous teacher-student pairs, highlighting its strong generalizability. Moreover, its plug-and-play and parameter-free nature enables flexible integration with different visual tasks. Extensive experiments on different visual benchmarks consistently confirm the superiority of our method in enhancing the performance of student models.

Related papers

MSConv: Multiplicative and Subtractive Convolution for Face Recognition [7.230136103375249]
We propose an efficient convolution module called MSConv (Multiplicative and Subtractive Convolution)<n>Specifically, we employ multi-scale mixed convolution to capture both local and broader contextual information from face images.<n> Experimental results demonstrate that by integrating both salient and differential features, MSConv outperforms models that only focus on salient features.
arXiv Detail & Related papers (2025-03-08T12:18:29Z)
Semi-supervised Semantic Segmentation for Remote Sensing Images via Multi-scale Uncertainty Consistency and Cross-Teacher-Student Attention [59.19580789952102]
This paper proposes a novel semi-supervised Multi-Scale Uncertainty and Cross-Teacher-Student Attention (MUCA) model for RS image semantic segmentation tasks.<n>MUCA constrains the consistency among feature maps at different layers of the network by introducing a multi-scale uncertainty consistency regularization.<n>MUCA utilizes a Cross-Teacher-Student attention mechanism to guide the student network, guiding the student network to construct more discriminative feature representations.
arXiv Detail & Related papers (2025-01-18T11:57:20Z)
MAT: Multi-Range Attention Transformer for Efficient Image Super-Resolution [14.265237560766268]
We introduce Multi-Range Attention Transformer (MAT) for image super-resolution (SR) tasks.<n>MAT facilitates both multi-range attention (MA) and sparse multi-range attention (SMA), enabling efficient capture of both regional and sparse global features.<n>We also introduce the MSConvStar module, which augments the model's ability for multi-range representation learning.
arXiv Detail & Related papers (2024-11-26T08:30:31Z)
UDON: Universal Dynamic Online distillatioN for generic image representations [5.487134463783365]
Universal image representations are critical in enabling real-world fine-grained and instance-level recognition applications.<n>Existing methods fail to capture important domain-specific knowledge, while ignoring differences in data distribution across different domains.<n>We introduce a new learning technique, dubbed UDON (Universal Dynamic Online DistillatioN)<n>UDON employs multi-teacher distillation, where each teacher is specialized in one domain, to transfer detailed domain-specific knowledge into the student universal embedding.
arXiv Detail & Related papers (2024-06-12T15:36:30Z)
Multi-Task Multi-Scale Contrastive Knowledge Distillation for Efficient Medical Image Segmentation [0.0]
This thesis aims to investigate the feasibility of knowledge transfer between neural networks for medical image segmentation tasks. In the context of medical imaging, where the data volumes are often limited, leveraging knowledge from a larger pre-trained network could be useful.
arXiv Detail & Related papers (2024-06-05T12:06:04Z)
HCVP: Leveraging Hierarchical Contrastive Visual Prompt for Domain Generalization [69.33162366130887]
Domain Generalization (DG) endeavors to create machine learning models that excel in unseen scenarios by learning invariant features. We introduce a novel method designed to supplement the model with domain-level and task-specific characteristics. This approach aims to guide the model in more effectively separating invariant features from specific characteristics, thereby boosting the generalization.
arXiv Detail & Related papers (2024-01-18T04:23:21Z)
A Novel Cross-Perturbation for Single Domain Generalization [54.612933105967606]
Single domain generalization aims to enhance the ability of the model to generalize to unknown domains when trained on a single source domain. The limited diversity in the training data hampers the learning of domain-invariant features, resulting in compromised generalization performance. We propose CPerb, a simple yet effective cross-perturbation method to enhance the diversity of the training data.
arXiv Detail & Related papers (2023-08-02T03:16:12Z)
On effects of Knowledge Distillation on Transfer Learning [0.0]
We propose a machine learning architecture we call TL+KD that combines knowledge distillation with transfer learning. We show that using guidance and knowledge from a larger teacher network during fine-tuning, we can improve the student network to achieve better validation performances like accuracy.
arXiv Detail & Related papers (2022-10-18T08:11:52Z)
Learning Knowledge Representation with Meta Knowledge Distillation for Single Image Super-Resolution [82.89021683451432]
We propose a model-agnostic meta knowledge distillation method under the teacher-student architecture for the single image super-resolution task. Experiments conducted on various single image super-resolution datasets demonstrate that our proposed method outperforms existing defined knowledge representation related distillation methods.
arXiv Detail & Related papers (2022-07-18T02:41:04Z)
Style Interleaved Learning for Generalizable Person Re-identification [69.03539634477637]
We propose a novel style interleaved learning (IL) framework for DG ReID training. Unlike conventional learning strategies, IL incorporates two forward propagations and one backward propagation for each iteration. We show that our model consistently outperforms state-of-the-art methods on large-scale benchmarks for DG ReID.
arXiv Detail & Related papers (2022-07-07T07:41:32Z)
Improving Ensemble Distillation With Weight Averaging and Diversifying Perturbation [22.87106703794863]
It motivates distilling knowledge from the ensemble teacher into a smaller student network. We propose a weight averaging technique where a student with multipleworks is trained to absorb the functional diversity of ensemble teachers. We also propose a perturbation strategy that seeks inputs from which the diversities of teachers can be better transferred to the student.
arXiv Detail & Related papers (2022-06-30T06:23:03Z)
Multi-scale and Cross-scale Contrastive Learning for Semantic Segmentation [5.281694565226513]
We apply contrastive learning to enhance the discriminative power of the multi-scale features extracted by semantic segmentation networks. By first mapping the encoder's multi-scale representations to a common feature space, we instantiate a novel form of supervised local-global constraint.
arXiv Detail & Related papers (2022-03-25T01:24:24Z)
Consistency and Diversity induced Human Motion Segmentation [231.36289425663702]
We propose a novel Consistency and Diversity induced human Motion (CDMS) algorithm. Our model factorizes the source and target data into distinct multi-layer feature spaces. A multi-mutual learning strategy is carried out to reduce the domain gap between the source and target data.
arXiv Detail & Related papers (2022-02-10T06:23:56Z)
Video Salient Object Detection via Adaptive Local-Global Refinement [7.723369608197167]
Video salient object detection (VSOD) is an important task in many vision applications. We propose an adaptive local-global refinement framework for VSOD. We show that our weighting methodology can further exploit the feature correlations, thus driving the network to learn more discriminative feature representation.
arXiv Detail & Related papers (2021-04-29T14:14:11Z)
Distilling Knowledge via Knowledge Review [69.15050871776552]
We study the factor of connection path cross levels between teacher and student networks, and reveal its great importance. For the first time in knowledge distillation, cross-stage connection paths are proposed. Our finally designed nested and compact framework requires negligible overhead, and outperforms other methods on a variety of tasks.
arXiv Detail & Related papers (2021-04-19T04:36:24Z)
Students are the Best Teacher: Exit-Ensemble Distillation with Multi-Exits [25.140055086630838]
This paper proposes a novel knowledge distillation-based learning method to improve the classification performance of convolutional neural networks (CNNs) Unlike the conventional notion of distillation where teachers only teach students, we show that students can also help other students and even the teacher to learn better.
arXiv Detail & Related papers (2021-04-01T07:10:36Z)
Incremental Embedding Learning via Zero-Shot Translation [65.94349068508863]
Current state-of-the-art incremental learning methods tackle catastrophic forgetting problem in traditional classification networks. We propose a novel class-incremental method for embedding network, named as zero-shot translation class-incremental method (ZSTCI) In addition, ZSTCI can easily be combined with existing regularization-based incremental learning methods to further improve performance of embedding networks.
arXiv Detail & Related papers (2020-12-31T08:21:37Z)
Wasserstein Contrastive Representation Distillation [114.24609306495456]
We propose Wasserstein Contrastive Representation Distillation (WCoRD), which leverages both primal and dual forms of Wasserstein distance for knowledge distillation. The dual form is used for global knowledge transfer, yielding a contrastive learning objective that maximizes the lower bound of mutual information between the teacher and the student networks. Experiments demonstrate that the proposed WCoRD method outperforms state-of-the-art approaches on privileged information distillation, model compression and cross-modal transfer.
arXiv Detail & Related papers (2020-12-15T23:43:28Z)
Interactive Knowledge Distillation [79.12866404907506]
We propose an InterActive Knowledge Distillation scheme to leverage the interactive teaching strategy for efficient knowledge distillation. In the distillation process, the interaction between teacher and student networks is implemented by a swapping-in operation. Experiments with typical settings of teacher-student networks demonstrate that the student networks trained by our IAKD achieve better performance than those trained by conventional knowledge distillation methods.
arXiv Detail & Related papers (2020-07-03T03:22:04Z)
Knowledge Distillation Meets Self-Supervision [109.6400639148393]
Knowledge distillation involves extracting "dark knowledge" from a teacher network to guide the learning of a student network. We show that the seemingly different self-supervision task can serve as a simple yet powerful solution. By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student.
arXiv Detail & Related papers (2020-06-12T12:18:52Z)
Global Context-Aware Progressive Aggregation Network for Salient Object Detection [117.943116761278]
We propose a novel network named GCPANet to integrate low-level appearance features, high-level semantic features, and global context features. We show that the proposed approach outperforms the state-of-the-art methods both quantitatively and qualitatively.
arXiv Detail & Related papers (2020-03-02T04:26:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.