Position Labels for Self-Supervised Vision Transformer
- URL: http://arxiv.org/abs/2206.04981v1
- Date: Fri, 10 Jun 2022 10:29:20 GMT
- Title: Position Labels for Self-Supervised Vision Transformer
- Authors: Zhemin Zhang, Xun Gong, Jinyi Wu
- Abstract summary: Position encoding is important for vision transformer (ViT) to capture the spatial structure of the input image.
We propose two position labels dedicated to 2D images including absolute position and relative position.
Our position labels can be easily plugged into transformer, combined with the various current ViT variants.
- Score: 1.3406858660972554
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Position encoding is important for vision transformer (ViT) to capture the
spatial structure of the input image. General efficacy has been proven in ViT.
In our work we propose to train ViT to recognize the 2D position encoding of
patches of the input image, this apparently simple task actually yields a
meaningful self-supervisory task. Based on previous work on ViT position
encoding, we propose two position labels dedicated to 2D images including
absolute position and relative position. Our position labels can be easily
plugged into transformer, combined with the various current ViT variants. It
can work in two ways: 1.As an auxiliary training target for vanilla ViT (e.g.,
ViT-B and Swin-B) to improve model performance. 2. Combine the self-supervised
ViT (e.g., MAE) to provide a more powerful self-supervised signal for semantic
feature learning. Experiments demonstrate that solely due to the proposed
self-supervised methods, Swin-B and ViT-B obtained improvements of 1.9% (top-1
Acc) and 5.6% (top-1 Acc) on Mini-ImageNet, respectively.
Related papers
- Denoising Vision Transformers [43.03068202384091]
We propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT)
In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis.
In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision.
arXiv Detail & Related papers (2024-01-05T18:59:52Z) - Understanding Gaussian Attention Bias of Vision Transformers Using
Effective Receptive Fields [7.58745191859815]
Vision transformers (ViTs) that model an image as a sequence of partitioned patches have shown notable performance in diverse vision tasks.
We propose explicitly adding a Gaussian attention bias that guides the positional embedding to have the corresponding pattern from the beginning of training.
The results showed that proposed method not only facilitates ViTs to understand images but also boosts their performance on various datasets.
arXiv Detail & Related papers (2023-05-08T14:12:25Z) - ViT-AE++: Improving Vision Transformer Autoencoder for Self-supervised
Medical Image Representations [3.6284577335311554]
Vision transformer-based autoencoder (ViT-AE) is a self-supervised learning technique that employs a patch-masking strategy to learn a meaningful latent space.
We propose two new loss functions to enhance the representation during training.
We extensively evaluate ViT-AE++ on both natural images and medical images, demonstrating consistent improvement over vanilla ViT-AE.
arXiv Detail & Related papers (2023-01-18T09:25:21Z) - Semi-supervised Vision Transformers at Scale [93.0621675558895]
We study semi-supervised learning (SSL) for vision transformers (ViT)
We propose a new SSL pipeline, consisting of first un/self-supervised pre-training, followed by supervised fine-tuning, and finally semi-supervised fine-tuning.
Our proposed method, dubbed Semi-ViT, achieves comparable or better performance than the CNN counterparts in the semi-supervised classification setting.
arXiv Detail & Related papers (2022-08-11T08:11:54Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - Vision Transformer Adapter for Dense Predictions [57.590511173416445]
Vision Transformer (ViT) achieves inferior performance on dense prediction tasks due to lacking prior information of images.
We propose a Vision Transformer Adapter (ViT-Adapter) which can remedy the defects of ViT and achieve comparable performance to vision-specific models.
We verify the effectiveness of our ViT-Adapter on multiple downstream tasks, including object detection, instance segmentation, and semantic segmentation.
arXiv Detail & Related papers (2022-05-17T17:59:11Z) - Coarse-to-Fine Vision Transformer [83.45020063642235]
We propose a coarse-to-fine vision transformer (CF-ViT) to relieve computational burden while retaining performance.
Our proposed CF-ViT is motivated by two important observations in modern ViT models.
Our CF-ViT reduces 53% FLOPs of LV-ViT, and also achieves 2.01x throughput.
arXiv Detail & Related papers (2022-03-08T02:57:49Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - Emerging Properties in Self-Supervised Vision Transformers [57.36837447500544]
We show that self-supervised ViTs provide new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets)
We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels.
We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.
arXiv Detail & Related papers (2021-04-29T12:28:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.