OAMixer: Object-aware Mixing Layer for Vision Transformers
- URL: http://arxiv.org/abs/2212.06595v1
- Date: Tue, 13 Dec 2022 14:14:48 GMT
- Title: OAMixer: Object-aware Mixing Layer for Vision Transformers
- Authors: Hyunwoo Kang, Sangwoo Mo, Jinwoo Shin
- Abstract summary: We propose OAMixer, which calibrates the patch mixing layers of patch-based models based on the object labels.
By learning an object-centric representation, we demonstrate that OAMixer improves the classification accuracy and background robustness of various patch-based models.
- Score: 73.10651373341933
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Patch-based models, e.g., Vision Transformers (ViTs) and Mixers, have shown
impressive results on various visual recognition tasks, alternating classic
convolutional networks. While the initial patch-based models (ViTs) treated all
patches equally, recent studies reveal that incorporating inductive bias like
spatiality benefits the representations. However, most prior works solely
focused on the location of patches, overlooking the scene structure of images.
Thus, we aim to further guide the interaction of patches using the object
information. Specifically, we propose OAMixer (object-aware mixing layer),
which calibrates the patch mixing layers of patch-based models based on the
object labels. Here, we obtain the object labels in unsupervised or
weakly-supervised manners, i.e., no additional human-annotating cost is
necessary. Using the object labels, OAMixer computes a reweighting mask with a
learnable scale parameter that intensifies the interaction of patches
containing similar objects and applies the mask to the patch mixing layers. By
learning an object-centric representation, we demonstrate that OAMixer improves
the classification accuracy and background robustness of various patch-based
models, including ViTs, MLP-Mixers, and ConvMixers. Moreover, we show that
OAMixer enhances various downstream tasks, including large-scale
classification, self-supervised learning, and multi-object recognition,
verifying the generic applicability of OAMixer
Related papers
- Learning to Mask and Permute Visual Tokens for Vision Transformer
Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)
MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.
Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - MixPro: Data Augmentation with MaskMix and Progressive Attention
Labeling for Vision Transformer [17.012278767127967]
We propose MaskMix and Progressive Attention Labeling in image and label space.
From the perspective of image space, we design MaskMix, which mixes two images based on a patch-like grid mask.
From the perspective of label space, we design PAL, which utilizes a progressive factor to dynamically re-weight the attention weights of the mixed attention label.
arXiv Detail & Related papers (2023-04-24T12:38:09Z) - Use the Detection Transformer as a Data Augmenter [13.15197086963704]
DeMix builds on CutMix, a simple yet highly effective data augmentation technique.
CutMix improves model performance by cutting and pasting a patch from one image onto another, yielding a new image.
DeMix elaborately selects a semantically rich patch, located by a pre-trained DETR.
arXiv Detail & Related papers (2023-04-10T12:50:17Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - SMMix: Self-Motivated Image Mixing for Vision Transformers [65.809376136455]
CutMix is a vital augmentation strategy that determines the performance and generalization ability of vision transformers (ViTs)
Existing CutMix variants tackle this problem by generating more consistent mixed images or more precise mixed labels.
We propose an efficient and effective Self-Motivated image Mixing method (SMMix) which motivates both image and label enhancement by the model under training itself.
arXiv Detail & Related papers (2022-12-26T00:19:39Z) - Patches Are All You Need? [96.88889685873106]
Vision Transformer (ViT) models may exceed their performance in some settings.
ViTs require the use of patch embeddings, which group together small regions of the image into single input features.
This raises a question: Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation?
arXiv Detail & Related papers (2022-01-24T16:42:56Z) - TransMix: Attend to Mix for Vision Transformers [26.775918851867246]
We propose TransMix, which mixes labels based on the attention maps of Vision Transformers.
The confidence of the label will be larger if the corresponding input image is weighted higher by the attention map.
TransMix consistently improves various ViT-based models at scales on ImageNet classification.
arXiv Detail & Related papers (2021-11-18T17:59:42Z) - SnapMix: Semantically Proportional Mixing for Augmenting Fine-grained
Data [124.95585891086894]
Proposal is called Semantically Proportional Mixing (SnapMix)
It exploits class activation map (CAM) to lessen the label noise in augmenting fine-grained data.
Our method consistently outperforms existing mixed-based approaches.
arXiv Detail & Related papers (2020-12-09T03:37:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.