Related papers: Self-Distilled Self-Supervised Representation Learning

Self-Distilled Self-Supervised Representation Learning

URL: http://arxiv.org/abs/2111.12958v1
Date: Thu, 25 Nov 2021 07:52:36 GMT
Title: Self-Distilled Self-Supervised Representation Learning
Authors: Jiho Jang, Seonhoon Kim, Kiyoon Yoo, Jangho Kim, Nojun Kwak
Abstract summary: State-of-the-art frameworks in self-supervised learning have recently shown that fully utilizing transformer-based models can lead to performance boost. In our work, we further exploit this by allowing the intermediate representations to learn from the final layers via the contrastive loss. Our method, Self-Distilled Self-Supervised Learning (SDSSL), outperforms competitive baselines (SimCLR, BYOL and MoCo v3) using ViT on various tasks and datasets.
Score: 35.60243157730165
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: State-of-the-art frameworks in self-supervised learning have recently shown that fully utilizing transformer-based models can lead to performance boost compared to conventional CNN models. Thriving to maximize the mutual information of two views of an image, existing works apply a contrastive loss to the final representations. In our work, we further exploit this by allowing the intermediate representations to learn from the final layers via the contrastive loss, which is maximizing the upper bound of the original goal and the mutual information between two layers. Our method, Self-Distilled Self-Supervised Learning (SDSSL), outperforms competitive baselines (SimCLR, BYOL and MoCo v3) using ViT on various tasks and datasets. In the linear evaluation and k-NN protocol, SDSSL not only leads to superior performance in the final layers, but also in most of the lower layers. Furthermore, positive and negative alignments are used to explain how representations are formed more effectively. Code will be available.

Related papers

Dual Guidance Semi-Supervised Action Detection [71.45023660211145]
We present a semi-supervised approach for spatial-temporal action localization.<n>We introduce a dual guidance network to select better pseudo-bounding boxes.<n>Our framework achieves superior results compared to extended image-based semi-supervised baselines.
arXiv Detail & Related papers (2025-07-28T18:08:36Z)
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
Semi-KAN: KAN Provides an Effective Representation for Semi-Supervised Learning in Medical Image Segmentation [2.717521115234258]
Semi-supervised medical image segmentation (SSMIS) offers a viable alternative to CNNs and ViTs. Inspired by Kolmogorov-Arnold Networks (KANs), we propose Semi-KAN. KANs exhibit superior representation learning capabilities with fewer parameters. We show that Semi-KAN surpasses baseline networks, utilizing fewer KAN layers and lower computational cost.
arXiv Detail & Related papers (2025-03-19T08:27:41Z)
Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples. For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge. We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z)
Unveiling Backbone Effects in CLIP: Exploring Representational Synergies and Variances [49.631908848868505]
Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning. We investigate the differences in CLIP performance among various neural architectures. We propose a simple, yet effective approach to combine predictions from multiple backbones, leading to a notable performance boost of up to 6.34%.
arXiv Detail & Related papers (2023-12-22T03:01:41Z)
MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks. We propose a single-stage and standalone method, MOCA, which unifies both desired properties. We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z)
DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning [37.48292304239107]
We present a transformer-based end-to-end ZSL method named DUET. We develop a cross-modal semantic grounding network to investigate the model's capability of disentangling semantic attributes from the images. We find that DUET can often achieve state-of-the-art performance, its components are effective and its predictions are interpretable.
arXiv Detail & Related papers (2022-07-04T11:12:12Z)
Virtual embeddings and self-consistency for self-supervised learning [43.086696088061416]
TriMix is a novel concept for self-supervised learning that generates virtual embeddings through linear data. We validate TriMix on eight benchmark datasets with an improvement of 2.71% and 0.41% better than the second-best models for both data types.
arXiv Detail & Related papers (2022-06-13T10:20:28Z)
Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images. MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z)
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval. Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference. Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z)
Imposing Consistency for Optical Flow Estimation [73.53204596544472]
Imposing consistency through proxy tasks has been shown to enhance data-driven learning. This paper introduces novel and effective consistency strategies for optical flow estimation.
arXiv Detail & Related papers (2022-04-14T22:58:30Z)
Efficient Self-supervised Vision Transformers for Representation Learning [86.57557009109411]
We show that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity. We propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies. Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation.
arXiv Detail & Related papers (2021-06-17T19:57:33Z)
Distilling Visual Priors from Self-Supervised Learning [24.79633121345066]
Convolutional Neural Networks (CNNs) are prone to overfit small training datasets. We present a novel two-phase pipeline that leverages self-supervised learning and knowledge distillation to improve the generalization ability of CNN models for image classification under the data-deficient setting.
arXiv Detail & Related papers (2020-08-01T13:07:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.