RePre: Improving Self-Supervised Vision Transformer with Reconstructive
Pre-training
- URL: http://arxiv.org/abs/2201.06857v2
- Date: Wed, 19 Jan 2022 03:26:39 GMT
- Title: RePre: Improving Self-Supervised Vision Transformer with Reconstructive
Pre-training
- Authors: Luya Wang, Feng Liang, Yangguang Li, Honggang Zhang, Wanli Ouyang,
Jing Shao
- Abstract summary: This paper incorporates local feature learning into self-supervised vision transformers via Reconstructive Pre-training (RePre)
Our RePre extends contrastive frameworks by adding a branch for reconstructing raw image pixels in parallel with the existing contrastive objective.
- Score: 80.44284270879028
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, self-supervised vision transformers have attracted unprecedented
attention for their impressive representation learning ability. However, the
dominant method, contrastive learning, mainly relies on an instance
discrimination pretext task, which learns a global understanding of the image.
This paper incorporates local feature learning into self-supervised vision
transformers via Reconstructive Pre-training (RePre). Our RePre extends
contrastive frameworks by adding a branch for reconstructing raw image pixels
in parallel with the existing contrastive objective. RePre is equipped with a
lightweight convolution-based decoder that fuses the multi-hierarchy features
from the transformer encoder. The multi-hierarchy features provide rich
supervisions from low to high semantic information, which are crucial for our
RePre. Our RePre brings decent improvements on various contrastive frameworks
with different vision transformer architectures. Transfer performance in
downstream tasks outperforms supervised pre-training and state-of-the-art
(SOTA) self-supervised counterparts.
Related papers
- Universal Approximation of Visual Autoregressive Transformers [28.909655919558706]
We extend our analysis to include Visual Autoregressive transformers.
Var represents a big step toward generating images using a novel, scalable, coarse-to-fine next-scale prediction'' framework.
Our results provide important design principles for effective and computationally efficient VAR Transformer strategies.
arXiv Detail & Related papers (2025-02-10T05:36:30Z) - Varformer: Adapting VAR's Generative Prior for Image Restoration [6.0648320320309885]
VAR, a novel image generative paradigm, surpasses diffusion models in generation quality by applying a next-scale prediction approach.
We formulate the multi-scale latent representations within VAR as the restoration prior, thus advancing our delicately designed VarFormer framework.
arXiv Detail & Related papers (2024-12-30T16:32:55Z) - How Powerful Potential of Attention on Image Restoration? [97.9777639562205]
We conduct an empirical study to explore the potential of attention mechanisms without using FFN.
We propose Continuous Scaling Attention (textbfCSAttn), a method that computes attention continuously in three stages without using FFN.
Our designs provide a closer look at the attention mechanism and reveal that some simple operations can significantly affect the model performance.
arXiv Detail & Related papers (2024-03-15T14:23:12Z) - Boosting Image Restoration via Priors from Pre-trained Models [54.83907596825985]
We learn an additional lightweight module called Pre-Train-Guided Refinement Module (PTG-RM) to refine restoration results of a target restoration network with OSF.
PTG-RM effectively enhances restoration performance of various models across different tasks, including low-light enhancement, deraining, deblurring, and denoising.
arXiv Detail & Related papers (2024-03-11T15:11:57Z) - Segmentation Guided Sparse Transformer for Under-Display Camera Image
Restoration [91.65248635837145]
Under-Display Camera (UDC) is an emerging technology that achieves full-screen display via hiding the camera under the display panel.
In this paper, we observe that when using the Vision Transformer for UDC degraded image restoration, the global attention samples a large amount of redundant information and noise.
We propose a Guided Sparse Transformer method (SGSFormer) for the task of restoring high-quality images from UDC degraded images.
arXiv Detail & Related papers (2024-03-09T13:11:59Z) - Class Incremental Learning with Pre-trained Vision-Language Models [59.15538370859431]
We propose an approach to exploiting pre-trained vision-language models (e.g. CLIP) that enables further adaptation.
Experiments on several conventional benchmarks consistently show a significant margin of improvement over the current state-of-the-art.
arXiv Detail & Related papers (2023-10-31T10:45:03Z) - Visual Prompt Tuning for Generative Transfer Learning [26.895321693202284]
We present a recipe for learning vision transformers by generative knowledge transfer.
We base our framework on state-of-the-art generative vision transformers that represent an image as a sequence of visual tokens to the autoregressive or non-autoregressive transformers.
To adapt to a new domain, we employ prompt tuning, which prepends learnable tokens called prompt to the image token sequence.
arXiv Detail & Related papers (2022-10-03T14:56:05Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Improve Vision Transformers Training by Suppressing Over-smoothing [28.171262066145612]
Introducing the transformer structure into computer vision tasks holds the promise of yielding a better speed-accuracy trade-off than traditional convolution networks.
However, directly training vanilla transformers on vision tasks has been shown to yield unstable and sub-optimal results.
Recent works propose to modify transformer structures by incorporating convolutional layers to improve the performance on vision tasks.
arXiv Detail & Related papers (2021-04-26T17:43:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.