Learning to Mask and Permute Visual Tokens for Vision Transformer
Pre-Training
- URL: http://arxiv.org/abs/2306.07346v1
- Date: Mon, 12 Jun 2023 18:12:19 GMT
- Title: Learning to Mask and Permute Visual Tokens for Vision Transformer
Pre-Training
- Authors: Lorenzo Baraldi, Roberto Amoroso, Marcella Cornia, Lorenzo Baraldi,
Andrea Pilzer, Rita Cucchiara
- Abstract summary: We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)
MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.
Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
- Score: 59.923672191632065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The use of self-supervised pre-training has emerged as a promising approach
to enhance the performance of visual tasks such as image classification. In
this context, recent approaches have employed the Masked Image Modeling
paradigm, which pre-trains a backbone by reconstructing visual tokens
associated with randomly masked image patches. This masking approach, however,
introduces noise into the input data during pre-training, leading to
discrepancies that can impair performance during the fine-tuning phase.
Furthermore, input masking neglects the dependencies between corrupted patches,
increasing the inconsistencies observed in downstream fine-tuning tasks. To
overcome these issues, we propose a new self-supervised pre-training approach,
named Masked and Permuted Vision Transformer (MaPeT), that employs
autoregressive and permuted predictions to capture intra-patch dependencies. In
addition, MaPeT employs auxiliary positional information to reduce the
disparity between the pre-training and fine-tuning phases. In our experiments,
we employ a fair setting to ensure reliable and meaningful comparisons and
conduct investigations on multiple visual tokenizers, including our proposed
$k$-CLIP which directly employs discretized CLIP features. Our results
demonstrate that MaPeT achieves competitive performance on ImageNet, compared
to baselines and competitors under the same model setting. Source code and
trained models are publicly available at: https://github.com/aimagelab/MaPeT.
Related papers
- Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - Improving Adversarial Robustness of Masked Autoencoders via Test-time
Frequency-domain Prompting [133.55037976429088]
We investigate the adversarial robustness of vision transformers equipped with BERT pretraining (e.g., BEiT, MAE)
A surprising observation is that MAE has significantly worse adversarial robustness than other BERT pretraining methods.
We propose a simple yet effective way to boost the adversarial robustness of MAE.
arXiv Detail & Related papers (2023-08-20T16:27:17Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - Masked Images Are Counterfactual Samples for Robust Fine-tuning [77.82348472169335]
Fine-tuning deep learning models can lead to a trade-off between in-distribution (ID) performance and out-of-distribution (OOD) robustness.
We propose a novel fine-tuning method, which uses masked images as counterfactual samples that help improve the robustness of the fine-tuning model.
arXiv Detail & Related papers (2023-03-06T11:51:28Z) - Efficient Masked Autoencoders with Self-Consistency [34.7076436760695]
Masked image modeling (MIM) has been recognized as a strong self-supervised pre-training method in computer vision.
We propose efficient masked autoencoders with self-consistency (EMAE) to improve the pre-training efficiency.
EMAE consistently obtains state-of-the-art transfer ability on a variety of downstream tasks, such as image classification, object detection, and semantic segmentation.
arXiv Detail & Related papers (2023-02-28T09:21:12Z) - Exploring the Coordination of Frequency and Attention in Masked Image Modeling [28.418445136155512]
Masked image modeling (MIM) has dominated self-supervised learning in computer vision.
We propose the Frequency & Attention-driven Masking and Throwing Strategy (FAMT), which can extract semantic patches and reduce the number of training patches.
FAMT can be seamlessly integrated as a plug-and-play module and surpasses previous works.
arXiv Detail & Related papers (2022-11-28T14:38:19Z) - Masked Image Modeling with Denoising Contrast [30.31920660487222]
Masked image modeling dominates this line of research with state-of-the-art performance on vision Transformers.
We introduce a new pre-training method, ConMIM, to produce simple intra-image inter-patch contrastive constraints.
ConMIM-pretrained vision Transformers with various scales achieve promising results on downstream image classification, semantic segmentation, object detection, and instance segmentation tasks.
arXiv Detail & Related papers (2022-05-19T15:22:29Z) - mc-BEiT: Multi-choice Discretization for Image BERT Pre-training [52.04866462439979]
Image BERT pre-training with masked image modeling (MIM) is a popular practice to cope with self-supervised representation learning.
We introduce an improved BERT-style image pre-training method, namely mc-BEiT, which performs MIM proxy tasks towards eased and refined multi-choice training objectives.
arXiv Detail & Related papers (2022-03-29T09:08:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.