RePre: Improving Self-Supervised Vision Transformer with Reconstructive
Pre-training
- URL: http://arxiv.org/abs/2201.06857v2
- Date: Wed, 19 Jan 2022 03:26:39 GMT
- Title: RePre: Improving Self-Supervised Vision Transformer with Reconstructive
Pre-training
- Authors: Luya Wang, Feng Liang, Yangguang Li, Honggang Zhang, Wanli Ouyang,
Jing Shao
- Abstract summary: This paper incorporates local feature learning into self-supervised vision transformers via Reconstructive Pre-training (RePre)
Our RePre extends contrastive frameworks by adding a branch for reconstructing raw image pixels in parallel with the existing contrastive objective.
- Score: 80.44284270879028
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, self-supervised vision transformers have attracted unprecedented
attention for their impressive representation learning ability. However, the
dominant method, contrastive learning, mainly relies on an instance
discrimination pretext task, which learns a global understanding of the image.
This paper incorporates local feature learning into self-supervised vision
transformers via Reconstructive Pre-training (RePre). Our RePre extends
contrastive frameworks by adding a branch for reconstructing raw image pixels
in parallel with the existing contrastive objective. RePre is equipped with a
lightweight convolution-based decoder that fuses the multi-hierarchy features
from the transformer encoder. The multi-hierarchy features provide rich
supervisions from low to high semantic information, which are crucial for our
RePre. Our RePre brings decent improvements on various contrastive frameworks
with different vision transformer architectures. Transfer performance in
downstream tasks outperforms supervised pre-training and state-of-the-art
(SOTA) self-supervised counterparts.
Related papers
- How Powerful Potential of Attention on Image Restoration? [97.9777639562205]
We conduct an empirical study to explore the potential of attention mechanisms without using FFN.
We propose Continuous Scaling Attention (textbfCSAttn), a method that computes attention continuously in three stages without using FFN.
Our designs provide a closer look at the attention mechanism and reveal that some simple operations can significantly affect the model performance.
arXiv Detail & Related papers (2024-03-15T14:23:12Z) - Boosting Image Restoration via Priors from Pre-trained Models [54.83907596825985]
We learn an additional lightweight module called Pre-Train-Guided Refinement Module (PTG-RM) to refine restoration results of a target restoration network with OSF.
PTG-RM effectively enhances restoration performance of various models across different tasks, including low-light enhancement, deraining, deblurring, and denoising.
arXiv Detail & Related papers (2024-03-11T15:11:57Z) - Segmentation Guided Sparse Transformer for Under-Display Camera Image
Restoration [91.65248635837145]
Under-Display Camera (UDC) is an emerging technology that achieves full-screen display via hiding the camera under the display panel.
In this paper, we observe that when using the Vision Transformer for UDC degraded image restoration, the global attention samples a large amount of redundant information and noise.
We propose a Guided Sparse Transformer method (SGSFormer) for the task of restoring high-quality images from UDC degraded images.
arXiv Detail & Related papers (2024-03-09T13:11:59Z) - Class Incremental Learning with Pre-trained Vision-Language Models [59.15538370859431]
We propose an approach to exploiting pre-trained vision-language models (e.g. CLIP) that enables further adaptation.
Experiments on several conventional benchmarks consistently show a significant margin of improvement over the current state-of-the-art.
arXiv Detail & Related papers (2023-10-31T10:45:03Z) - Boosting vision transformers for image retrieval [11.441395750267052]
Vision transformers have achieved remarkable progress in vision tasks such as image classification and detection.
However, in instance-level image retrieval, transformers have not yet shown good performance compared to convolutional networks.
We propose a number of improvements that make transformers outperform the state of the art for the first time.
arXiv Detail & Related papers (2022-10-21T12:17:12Z) - Visual Prompt Tuning for Generative Transfer Learning [26.895321693202284]
We present a recipe for learning vision transformers by generative knowledge transfer.
We base our framework on state-of-the-art generative vision transformers that represent an image as a sequence of visual tokens to the autoregressive or non-autoregressive transformers.
To adapt to a new domain, we employ prompt tuning, which prepends learnable tokens called prompt to the image token sequence.
arXiv Detail & Related papers (2022-10-03T14:56:05Z) - Pretraining the Vision Transformer using self-supervised methods for
vision based Deep Reinforcement Learning [0.0]
We study pretraining a Vision Transformer using several state-of-the-art self-supervised methods and assess the quality of the learned representations.
Our results show that all methods are effective in learning useful representations and avoiding representational collapse.
The encoder pretrained with the temporal order verification task shows the best results across all experiments.
arXiv Detail & Related papers (2022-09-22T10:18:59Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Improve Vision Transformers Training by Suppressing Over-smoothing [28.171262066145612]
Introducing the transformer structure into computer vision tasks holds the promise of yielding a better speed-accuracy trade-off than traditional convolution networks.
However, directly training vanilla transformers on vision tasks has been shown to yield unstable and sub-optimal results.
Recent works propose to modify transformer structures by incorporating convolutional layers to improve the performance on vision tasks.
arXiv Detail & Related papers (2021-04-26T17:43:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.