Img2Vec: A Teacher of High Token-Diversity Helps Masked AutoEncoders
- URL: http://arxiv.org/abs/2304.12535v1
- Date: Tue, 25 Apr 2023 03:01:37 GMT
- Title: Img2Vec: A Teacher of High Token-Diversity Helps Masked AutoEncoders
- Authors: Heng Pan, Chenyang Liu, Wenxiao Wang, Li Yuan, Hongfa Wang, Zhifeng
Li, Wei Liu
- Abstract summary: We present a pipeline of Image to Vector (Img2Vec) for masked image modeling (MIM) with deep features.
Img2Vec is a simple yet effective framework tailored to deep feature MIM learning, accomplishing superb comprehensive performance on representative vision tasks.
- Score: 17.564722905991776
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a pipeline of Image to Vector (Img2Vec) for masked image modeling
(MIM) with deep features. To study which type of deep features is appropriate
for MIM as a learning target, we propose a simple MIM framework with serials of
well-trained self-supervised models to convert an Image to a feature Vector as
the learning target of MIM, where the feature extractor is also known as a
teacher model. Surprisingly, we empirically find that an MIM model benefits
more from image features generated by some lighter models (e.g., ResNet-50,
26M) than from those by a cumbersome teacher like Transformer-based models
(e.g., ViT-Large, 307M). To analyze this remarkable phenomenon, we devise a
novel attribute, token diversity, to evaluate the characteristics of generated
features from different models. Token diversity measures the feature
dissimilarity among different tokens. Through extensive experiments and
visualizations, we hypothesize that beyond the acknowledgment that a large
model can improve MIM, a high token-diversity of a teacher model is also
crucial. Based on the above discussion, Img2Vec adopts a teacher model with
high token-diversity to generate image features. Img2Vec pre-trained on
ImageNet unlabeled data with ViT-B yields 85.1\% top-1 accuracy on fine-tuning.
Moreover, we scale up Img2Vec on larger models, ViT-L and ViT-H, and get
$86.7\%$ and $87.5\%$ accuracy respectively. It also achieves state-of-the-art
results on other downstream tasks, e.g., 51.8\% mAP on COCO and 50.7\% mIoU on
ADE20K. Img2Vec is a simple yet effective framework tailored to deep feature
MIM learning, accomplishing superb comprehensive performance on representative
vision tasks.
Related papers
- Fine-tuning a Multiple Instance Learning Feature Extractor with Masked
Context Modelling and Knowledge Distillation [0.21756081703275998]
We propose to increase downstream MIL classification by fine-tuning the feature extractor model using itMasked Context Modelling with Knowledge Distillation.
A single epoch of the proposed task suffices to increase the downstream performance of the feature-extractor model when used in a MIL scenario, while being considerably smaller and requiring a fraction of its compute.
arXiv Detail & Related papers (2024-03-08T14:04:30Z) - Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z) - Heterogeneous Generative Knowledge Distillation with Masked Image
Modeling [33.95780732124864]
Masked image modeling (MIM) methods achieve great success in various visual tasks but remain largely unexplored in knowledge distillation for heterogeneous deep models.
We develop the first Heterogeneous Generative Knowledge Distillation (H-GKD) based on MIM, which can efficiently transfer knowledge from large Transformer models to small CNN-based models in a generative self-supervised fashion.
Our method is a simple yet effective learning paradigm to learn the visual representation and distribution of data from heterogeneous teacher models.
arXiv Detail & Related papers (2023-09-18T08:30:55Z) - Delving Deeper into Data Scaling in Masked Image Modeling [145.36501330782357]
We conduct an empirical study on the scaling capability of masked image modeling (MIM) methods for visual recognition.
Specifically, we utilize the web-collected Coyo-700M dataset.
Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models.
arXiv Detail & Related papers (2023-05-24T15:33:46Z) - TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models [31.16595289223858]
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs)
However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach.
We explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones.
arXiv Detail & Related papers (2023-01-03T18:59:54Z) - Masked Video Distillation: Rethinking Masked Feature Modeling for
Self-supervised Video Representation Learning [123.63301596019522]
Masked video distillation (MVD) is a simple yet effective two-stage masked feature modeling framework for video representation learning.
For the choice of teacher models, we observe that students taught by video teachers perform better on temporally-heavy video tasks.
We design a spatial-temporal co-teaching method for MVD to leverage the advantage of different teachers.
arXiv Detail & Related papers (2022-12-08T18:59:59Z) - CAE v2: Context Autoencoder with CLIP Target [63.61868058214267]
Masked image modeling (MIM) learns visual representation by masking and reconstructing image patches.
Applying the reconstruction supervision on the CLIP representation has been proven effective for MIM.
To investigate strategies for refining the CLIP-targeted MIM, we study two critical elements in MIM, i.e., the supervision position and the mask ratio.
arXiv Detail & Related papers (2022-11-17T18:58:33Z) - Stare at What You See: Masked Image Modeling without Reconstruction [154.74533119863864]
Masked Autoencoders (MAE) have been prevailing paradigms for large-scale vision representation pre-training.
Recent approaches apply semantic-rich teacher models to extract image features as the reconstruction target, leading to better performance.
We argue the features extracted by powerful teacher models already encode rich semantic correlation across regions in an intact image.
arXiv Detail & Related papers (2022-11-16T12:48:52Z) - BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers [117.79456335844439]
We propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction.
We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches.
Experiments on image classification and semantic segmentation show that our approach outperforms all compared MIM methods.
arXiv Detail & Related papers (2022-08-12T16:48:10Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.