CAE v2: Context Autoencoder with CLIP Target
- URL: http://arxiv.org/abs/2211.09799v1
- Date: Thu, 17 Nov 2022 18:58:33 GMT
- Title: CAE v2: Context Autoencoder with CLIP Target
- Authors: Xinyu Zhang, Jiahui Chen, Junkun Yuan, Qiang Chen, Jian Wang, Xiaodi
Wang, Shumin Han, Xiaokang Chen, Jimin Pi, Kun Yao, Junyu Han, Errui Ding,
Jingdong Wang
- Abstract summary: Masked image modeling (MIM) learns visual representation by masking and reconstructing image patches.
Applying the reconstruction supervision on the CLIP representation has been proven effective for MIM.
To investigate strategies for refining the CLIP-targeted MIM, we study two critical elements in MIM, i.e., the supervision position and the mask ratio.
- Score: 63.61868058214267
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked image modeling (MIM) learns visual representation by masking and
reconstructing image patches. Applying the reconstruction supervision on the
CLIP representation has been proven effective for MIM. However, it is still
under-explored how CLIP supervision in MIM influences performance. To
investigate strategies for refining the CLIP-targeted MIM, we study two
critical elements in MIM, i.e., the supervision position and the mask ratio,
and reveal two interesting perspectives, relying on our developed simple
pipeline, context autodecoder with CLIP target (CAE v2). Firstly, we observe
that the supervision on visible patches achieves remarkable performance, even
better than that on masked patches, where the latter is the standard format in
the existing MIM methods. Secondly, the optimal mask ratio positively
correlates to the model size. That is to say, the smaller the model, the lower
the mask ratio needs to be. Driven by these two discoveries, our simple and
concise approach CAE v2 achieves superior performance on a series of downstream
tasks. For example, a vanilla ViT-Large model achieves 81.7% and 86.7% top-1
accuracy on linear probing and fine-tuning on ImageNet-1K, and 55.9% mIoU on
semantic segmentation on ADE20K with the pre-training for 300 epochs. We hope
our findings can be helpful guidelines for the pre-training in the MIM area,
especially for the small-scale models.
Related papers
- Adapting LLaMA Decoder to Vision Transformer [65.47663195233802]
This work examines whether decoder-only Transformers such as LLaMA can be adapted to the computer vision field.
We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a causal mask to the self-attention brings an attention collapse issue.
We develop a soft mask strategy that gradually introduces a causal mask to the self-attention at the onset of training to facilitate the optimization behavior.
arXiv Detail & Related papers (2024-04-10T06:30:08Z) - MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations [16.885965702357314]
MIM-Refiner is a contrastive learning boost for pre-trained MIM models.
We refine the features of MIM models from subpar to state-of-the-art, off-the-shelf features.
arXiv Detail & Related papers (2024-02-15T16:46:16Z) - RevColV2: Exploring Disentangled Representations in Masked Image
Modeling [12.876864261893909]
Masked image modeling (MIM) has become a prevalent pre-training setup for vision foundation models and attains promising performance.
Existing MIM methods discard the decoder network during downstream applications, resulting in inconsistent representations between pre-training and fine-tuning.
We propose a new architecture, RevColV2, which tackles this issue by keeping the entire autoencoder architecture during both pre-training and fine-tuning.
arXiv Detail & Related papers (2023-09-02T18:41:27Z) - Improving Pixel-based MIM by Reducing Wasted Modeling Capability [77.99468514275185]
We propose a new method that explicitly utilizes low-level features from shallow layers to aid pixel reconstruction.
To the best of our knowledge, we are the first to systematically investigate multi-level feature fusion for isotropic architectures.
Our method yields significant performance gains, such as 1.2% on fine-tuning, 2.8% on linear probing, and 2.6% on semantic segmentation.
arXiv Detail & Related papers (2023-08-01T03:44:56Z) - Img2Vec: A Teacher of High Token-Diversity Helps Masked AutoEncoders [17.564722905991776]
We present a pipeline of Image to Vector (Img2Vec) for masked image modeling (MIM) with deep features.
Img2Vec is a simple yet effective framework tailored to deep feature MIM learning, accomplishing superb comprehensive performance on representative vision tasks.
arXiv Detail & Related papers (2023-04-25T03:01:37Z) - PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling [83.67628239775878]
Masked Image Modeling (MIM) has achieved promising progress with the advent of Masked Autoencoders (MAE) and BEiT.
This paper undertakes a fundamental analysis of MIM from the perspective of pixel reconstruction.
We propose a remarkably simple and effective method, ourmethod, that entails two strategies.
arXiv Detail & Related papers (2023-03-04T13:38:51Z) - Layer Grafted Pre-training: Bridging Contrastive Learning And Masked
Image Modeling For Label-Efficient Representations [130.05189514598996]
Mask Image Modeling (MIM) and Contrastive Learning (CL) demonstrate that self-supervision is powerful to learn good representations.
In this paper, we make the empirical observation that a naive joint optimization of CL and MIM losses leads to conflicting gradient directions.
Inspired by experimental observations, we find that MIM and CL are suitable to lower and higher layers, respectively.
We propose a surprisingly simple, "sequential cascade" fashion: early layers are first trained under one MIM loss, on top of which latter layers continue to be trained under another CL loss.
arXiv Detail & Related papers (2023-02-27T20:52:10Z) - TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models [31.16595289223858]
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs)
However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach.
We explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones.
arXiv Detail & Related papers (2023-01-03T18:59:54Z) - BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers [117.79456335844439]
We propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction.
We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches.
Experiments on image classification and semantic segmentation show that our approach outperforms all compared MIM methods.
arXiv Detail & Related papers (2022-08-12T16:48:10Z) - Revealing the Dark Secrets of Masked Image Modeling [25.221516344869805]
Masked image modeling (MIM) as pre-training is shown to be effective for numerous vision downstream tasks, but how and where MIM works remain unclear.
In this paper, we compare MIM with the long-dominant supervised pre-trained models from two perspectives, the visualizations and the experiments.
We find that MIM brings locality inductive bias to all layers of the trained models, but supervised models tend to focus locally at lower layers but more globally at higher layers.
arXiv Detail & Related papers (2022-05-26T17:59:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.