PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers
- URL: http://arxiv.org/abs/2111.12710v1
- Date: Wed, 24 Nov 2021 18:59:58 GMT
- Title: PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers
- Authors: Xiaoyi Dong and Jianmin Bao and Ting Zhang and Dongdong Chen and
Weiming Zhang and Lu Yuan and Dong Chen and Fang Wen and Nenghai Yu
- Abstract summary: This paper explores a better codebook for BERT pre-training of vision transformers.
By contrast, the discrete tokens in NLP field are naturally highly semantic.
We demonstrate that the visual tokens generated by the proposed perceptual codebook do exhibit better semantic meanings.
- Score: 102.7922200135147
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper explores a better codebook for BERT pre-training of vision
transformers. The recent work BEiT successfully transfers BERT pre-training
from NLP to the vision field. It directly adopts one simple discrete VAE as the
visual tokenizer, but has not considered the semantic level of the resulting
visual tokens. By contrast, the discrete tokens in NLP field are naturally
highly semantic. This difference motivates us to learn a perceptual codebook.
And we surprisingly find one simple yet effective idea: enforcing perceptual
similarity during the dVAE training. We demonstrate that the visual tokens
generated by the proposed perceptual codebook do exhibit better semantic
meanings, and subsequently help pre-training achieve superior transfer
performance in various downstream tasks. For example, we achieve 84.5 Top-1
accuracy on ImageNet-1K with ViT-B backbone, outperforming the competitive
method BEiT by +1.3 with the same pre-training epochs. It can also improve the
performance of object detection and segmentation tasks on COCO val by +1.3 box
AP and +1.0 mask AP, semantic segmentation on ADE20k by +1.0 mIoU, The code and
models will be available at \url{https://github.com/microsoft/PeCo}.
Related papers
- Rejuvenating image-GPT as Strong Visual Representation Learners [28.77567067712619]
This paper enhances image-GPT, one of the pioneering works that introduce autoregressive pretraining to predict the next pixels.
We shift the prediction target from raw pixels to semantic tokens, enabling a higher-level understanding of visual content.
Experiments showcase that D-iGPT excels as a strong learner of visual representations.
arXiv Detail & Related papers (2023-12-04T18:59:20Z) - AdPE: Adversarial Positional Embeddings for Pretraining Vision
Transformers via MAE+ [44.856035786948915]
We propose an Adversarial Positional Embedding (AdPE) approach to pretrain vision transformers.
AdPE distorts the local visual structures by perturbing the position encodings.
Experiments demonstrate that our approach can improve the fine-tuning accuracy of MAE.
arXiv Detail & Related papers (2023-03-14T02:42:01Z) - EfficientTrain: Exploring Generalized Curriculum Learning for Training
Visual Backbones [80.662250618795]
This paper presents a new curriculum learning approach for the efficient training of visual backbones (e.g., vision Transformers)
As an off-the-shelf method, it reduces the wall-time training cost of a wide variety of popular models by >1.5x on ImageNet-1K/22K without sacrificing accuracy.
arXiv Detail & Related papers (2022-11-17T17:38:55Z) - Token-Label Alignment for Vision Transformers [93.58540411138164]
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs)
We identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies.
We propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token.
arXiv Detail & Related papers (2022-10-12T17:54:32Z) - Bootstrapped Masked Autoencoders for Vision BERT Pretraining [142.5285802605117]
BootMAE improves the original masked autoencoders (MAE) with two core designs.
1) momentum encoder that provides online feature as extra BERT prediction targets; 2) target-aware decoder that tries to reduce the pressure on the encoder to memorize target-specific information in BERT pretraining.
arXiv Detail & Related papers (2022-07-14T17:59:58Z) - Patch-level Representation Learning for Self-supervised Vision
Transformers [68.8862419248863]
Vision Transformers (ViTs) have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks.
Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations.
We demonstrate that SelfPatch can significantly improve the performance of existing SSL methods for various visual tasks.
arXiv Detail & Related papers (2022-06-16T08:01:19Z) - mc-BEiT: Multi-choice Discretization for Image BERT Pre-training [52.04866462439979]
Image BERT pre-training with masked image modeling (MIM) is a popular practice to cope with self-supervised representation learning.
We introduce an improved BERT-style image pre-training method, namely mc-BEiT, which performs MIM proxy tasks towards eased and refined multi-choice training objectives.
arXiv Detail & Related papers (2022-03-29T09:08:18Z) - BEiT: BERT Pre-Training of Image Transformers [43.704968112586876]
We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional representation from Image Transformers.
Specifically, each image has two views in our pre-training, i.e., image patches, and visual tokens.
We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer.
The pre-training objective is to recover the original visual tokens based on the corrupted image patches.
arXiv Detail & Related papers (2021-06-15T16:02:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.