Siamese Image Modeling for Self-Supervised Vision Representation
Learning
- URL: http://arxiv.org/abs/2206.01204v1
- Date: Thu, 2 Jun 2022 17:59:58 GMT
- Title: Siamese Image Modeling for Self-Supervised Vision Representation
Learning
- Authors: Chenxin Tao, Xizhou Zhu, Gao Huang, Yu Qiao, Xiaogang Wang, Jifeng Dai
- Abstract summary: Self-supervised learning (SSL) has delivered superior performance on a variety of downstream vision tasks.
Two main-stream SSL frameworks have been proposed, i.e., Instance Discrimination (ID) and Masked Image Modeling (MIM)
We propose Siamese Image Modeling (SIM), which predicts the dense representations of an augmented view.
- Score: 73.78790119050056
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning (SSL) has delivered superior performance on a
variety of downstream vision tasks. Two main-stream SSL frameworks have been
proposed, i.e., Instance Discrimination (ID) and Masked Image Modeling (MIM).
ID pulls together the representations of different views from the same image,
while avoiding feature collapse. It does well on linear probing but is inferior
in detection performance. On the other hand, MIM reconstructs the original
content given a masked image. It excels at dense prediction but fails to
perform well on linear probing. Their distinctions are caused by neglecting the
representation requirements of either semantic alignment or spatial
sensitivity. Specifically, we observe that (1) semantic alignment demands
semantically similar views to be projected into nearby representation, which
can be achieved by contrasting different views with strong augmentations; (2)
spatial sensitivity requires to model the local structure within an image.
Predicting dense representations with masked image is therefore beneficial
because it models the conditional distribution of image content. Driven by
these analysis, we propose Siamese Image Modeling (SIM), which predicts the
dense representations of an augmented view, based on another masked view from
the same image but with different augmentations. Our method uses a Siamese
network with two branches. The online branch encodes the first view, and
predicts the second view's representation according to the relative positions
between these two views. The target branch produces the target by encoding the
second view. In this way, we are able to achieve comparable linear probing and
dense prediction performances with ID and MIM, respectively. We also
demonstrate that decent linear probing result can be obtained without a global
loss. Code shall be released.
Related papers
- SA$^2$VP: Spatially Aligned-and-Adapted Visual Prompt [59.280491260635266]
Methods for visual prompt tuning follow the sequential modeling paradigm stemming from NLP.
Mymodel model learns a two-dimensional prompt token map with equal (or scaled) size to the image token map.
Our model can conduct individual prompting for different image tokens in a fine-grained manner.
arXiv Detail & Related papers (2023-12-16T08:23:43Z) - Self-supervised Cross-view Representation Reconstruction for Change
Captioning [113.08380679787247]
Change captioning aims to describe the difference between a pair of similar images.
Its key challenge is how to learn a stable difference representation under pseudo changes caused by viewpoint change.
We propose a self-supervised cross-view representation reconstruction network.
arXiv Detail & Related papers (2023-09-28T09:28:50Z) - CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View
Completion [20.121597331207276]
Masked Image Modeling (MIM) has recently been established as a potent pre-training paradigm.
In this paper we seek to learn representations that transfer well to a wide variety of 3D vision and lower-level geometric downstream tasks.
Our experiments show that our pretext task leads to significantly improved performance for monocular 3D vision downstream tasks.
arXiv Detail & Related papers (2022-10-19T16:50:36Z) - Dense Semantic Contrast for Self-Supervised Visual Representation
Learning [12.636783522731392]
We present Dense Semantic Contrast (DSC) for modeling semantic category decision boundaries at a dense level.
We propose a dense cross-image semantic contrastive learning framework for multi-granularity representation learning.
Experimental results show that our DSC model outperforms state-of-the-art methods when transferring to downstream dense prediction tasks.
arXiv Detail & Related papers (2021-09-16T07:04:05Z) - Seed the Views: Hierarchical Semantic Alignment for Contrastive
Representation Learning [116.91819311885166]
We propose a hierarchical semantic alignment strategy via expanding the views generated by a single image to textbfCross-samples and Multi-level representation.
Our method, termed as CsMl, has the ability to integrate multi-level visual representations across samples in a robust way.
arXiv Detail & Related papers (2020-12-04T17:26:24Z) - Unsupervised Learning of Dense Visual Representations [14.329781842154281]
We propose View-Agnostic Dense Representation (VADeR) for unsupervised learning of dense representations.
VADeR learns pixelwise representations by forcing local features to remain constant over different viewing conditions.
Our method outperforms ImageNet supervised pretraining in multiple dense prediction tasks.
arXiv Detail & Related papers (2020-11-11T01:28:11Z) - Self-Supervised Ranking for Representation Learning [108.38993212650577]
We present a new framework for self-supervised representation learning by formulating it as a ranking problem in an image retrieval context.
We train a representation encoder by maximizing average precision (AP) for ranking, where random views of an image are considered positively related.
In principle, by using a ranking criterion, we eliminate reliance on object-centric curated datasets.
arXiv Detail & Related papers (2020-10-14T17:24:56Z) - Multi-Margin based Decorrelation Learning for Heterogeneous Face
Recognition [90.26023388850771]
This paper presents a deep neural network approach to extract decorrelation representations in a hyperspherical space for cross-domain face images.
The proposed framework can be divided into two components: heterogeneous representation network and decorrelation representation learning.
Experimental results on two challenging heterogeneous face databases show that our approach achieves superior performance on both verification and recognition tasks.
arXiv Detail & Related papers (2020-05-25T07:01:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.