Patch-level Representation Learning for Self-supervised Vision
Transformers
- URL: http://arxiv.org/abs/2206.07990v2
- Date: Fri, 17 Jun 2022 01:35:03 GMT
- Title: Patch-level Representation Learning for Self-supervised Vision
Transformers
- Authors: Sukmin Yun, Hankook Lee, Jaehyung Kim, Jinwoo Shin
- Abstract summary: Vision Transformers (ViTs) have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks.
Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations.
We demonstrate that SelfPatch can significantly improve the performance of existing SSL methods for various visual tasks.
- Score: 68.8862419248863
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent self-supervised learning (SSL) methods have shown impressive results
in learning visual representations from unlabeled images. This paper aims to
improve their performance further by utilizing the architectural advantages of
the underlying neural network, as the current state-of-the-art visual pretext
tasks for SSL do not enjoy the benefit, i.e., they are architecture-agnostic.
In particular, we focus on Vision Transformers (ViTs), which have gained much
attention recently as a better architectural choice, often outperforming
convolutional networks for various visual tasks. The unique characteristic of
ViT is that it takes a sequence of disjoint patches from an image and processes
patch-level representations internally. Inspired by this, we design a simple
yet effective visual pretext task, coined SelfPatch, for learning better
patch-level representations. To be specific, we enforce invariance against each
patch and its neighbors, i.e., each patch treats similar neighboring patches as
positive samples. Consequently, training ViTs with SelfPatch learns more
semantically meaningful relations among patches (without using human-annotated
labels), which can be beneficial, in particular, to downstream tasks of a dense
prediction type. Despite its simplicity, we demonstrate that it can
significantly improve the performance of existing SSL methods for various
visual tasks, including object detection and semantic segmentation.
Specifically, SelfPatch significantly improves the recent self-supervised ViT,
DINO, by achieving +1.3 AP on COCO object detection, +1.2 AP on COCO instance
segmentation, and +2.9 mIoU on ADE20K semantic segmentation.
Related papers
- Semantic Graph Consistency: Going Beyond Patches for Regularizing Self-Supervised Vision Transformers [5.359378066251386]
Self-supervised learning with vision transformers (ViTs) has proven effective for representation learning.
Existing ViT-based SSL architectures do not fully exploit the ViT backbone.
We introduce a novel Semantic Graph Consistency (SGC) module to regularize ViT-based SSL methods and leverage patch tokens effectively.
arXiv Detail & Related papers (2024-06-18T06:36:44Z) - Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism.
We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions.
With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z) - So-ViT: Mind Visual Tokens for Vision Transformer [27.243241133304785]
We propose a new classification paradigm, where the second-order, cross-covariance pooling of visual tokens is combined with class token for final classification.
We develop a light-weight, hierarchical module based on off-the-shelf convolutions for visual token embedding.
The results show our models, when trained from scratch, outperform the competing ViT variants, while being on par with or better than state-of-the-art CNN models.
arXiv Detail & Related papers (2021-04-22T09:05:09Z) - Unsupervised Pretraining for Object Detection by Patch Reidentification [72.75287435882798]
Unsupervised representation learning achieves promising performances in pre-training representations for object detectors.
This work proposes a simple yet effective representation learning method for object detection, named patch re-identification (Re-ID)
Our method significantly outperforms its counterparts on COCO in all settings, such as different training iterations and data percentages.
arXiv Detail & Related papers (2021-03-08T15:13:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.