MixPro: Data Augmentation with MaskMix and Progressive Attention
Labeling for Vision Transformer
- URL: http://arxiv.org/abs/2304.12043v2
- Date: Mon, 7 Aug 2023 10:20:59 GMT
- Title: MixPro: Data Augmentation with MaskMix and Progressive Attention
Labeling for Vision Transformer
- Authors: Qihao Zhao and Yangyu Huang and Wei Hu and Fan Zhang and Jun Liu
- Abstract summary: We propose MaskMix and Progressive Attention Labeling in image and label space.
From the perspective of image space, we design MaskMix, which mixes two images based on a patch-like grid mask.
From the perspective of label space, we design PAL, which utilizes a progressive factor to dynamically re-weight the attention weights of the mixed attention label.
- Score: 17.012278767127967
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recently proposed data augmentation TransMix employs attention labels to
help visual transformers (ViT) achieve better robustness and performance.
However, TransMix is deficient in two aspects: 1) The image cropping method of
TransMix may not be suitable for ViTs. 2) At the early stage of training, the
model produces unreliable attention maps. TransMix uses unreliable attention
maps to compute mixed attention labels that can affect the model. To address
the aforementioned issues, we propose MaskMix and Progressive Attention
Labeling (PAL) in image and label space, respectively. In detail, from the
perspective of image space, we design MaskMix, which mixes two images based on
a patch-like grid mask. In particular, the size of each mask patch is
adjustable and is a multiple of the image patch size, which ensures each image
patch comes from only one image and contains more global contents. From the
perspective of label space, we design PAL, which utilizes a progressive factor
to dynamically re-weight the attention weights of the mixed attention label.
Finally, we combine MaskMix and Progressive Attention Labeling as our new data
augmentation method, named MixPro. The experimental results show that our
method can improve various ViT-based models at scales on ImageNet
classification (73.8\% top-1 accuracy based on DeiT-T for 300 epochs). After
being pre-trained with MixPro on ImageNet, the ViT-based models also
demonstrate better transferability to semantic segmentation, object detection,
and instance segmentation. Furthermore, compared to TransMix, MixPro also shows
stronger robustness on several benchmarks. The code is available at
https://github.com/fistyee/MixPro.
Related papers
- SpliceMix: A Cross-scale and Semantic Blending Augmentation Strategy for
Multi-label Image Classification [46.8141860303439]
We introduce a simple but effective augmentation strategy for multi-label image classification, namely SpliceMix.
The "splice" in our method is two-fold: 1) Each mixed image is a splice of several downsampled images in the form of a grid, where the semantics of images attending to mixing are blended without object deficiencies for alleviating co-occurred bias; 2) We splice mixed images and the original mini-batch to form a new SpliceMixed mini-batch, which allows an image with different scales to contribute to training together.
arXiv Detail & Related papers (2023-11-26T05:45:27Z) - Use the Detection Transformer as a Data Augmenter [13.15197086963704]
DeMix builds on CutMix, a simple yet highly effective data augmentation technique.
CutMix improves model performance by cutting and pasting a patch from one image onto another, yielding a new image.
DeMix elaborately selects a semantically rich patch, located by a pre-trained DETR.
arXiv Detail & Related papers (2023-04-10T12:50:17Z) - SMMix: Self-Motivated Image Mixing for Vision Transformers [65.809376136455]
CutMix is a vital augmentation strategy that determines the performance and generalization ability of vision transformers (ViTs)
Existing CutMix variants tackle this problem by generating more consistent mixed images or more precise mixed labels.
We propose an efficient and effective Self-Motivated image Mixing method (SMMix) which motivates both image and label enhancement by the model under training itself.
arXiv Detail & Related papers (2022-12-26T00:19:39Z) - OAMixer: Object-aware Mixing Layer for Vision Transformers [73.10651373341933]
We propose OAMixer, which calibrates the patch mixing layers of patch-based models based on the object labels.
By learning an object-centric representation, we demonstrate that OAMixer improves the classification accuracy and background robustness of various patch-based models.
arXiv Detail & Related papers (2022-12-13T14:14:48Z) - MagicMix: Semantic Mixing with Diffusion Models [85.43291162563652]
We explore a new task called semantic mixing, aiming at blending two different semantics to create a new concept.
We present MagicMix, a solution based on pre-trained text-conditioned diffusion models.
Our method does not require any spatial mask or re-training, yet is able to synthesize novel objects with high fidelity.
arXiv Detail & Related papers (2022-10-28T11:07:48Z) - Patches Are All You Need? [96.88889685873106]
Vision Transformer (ViT) models may exceed their performance in some settings.
ViTs require the use of patch embeddings, which group together small regions of the image into single input features.
This raises a question: Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation?
arXiv Detail & Related papers (2022-01-24T16:42:56Z) - TransMix: Attend to Mix for Vision Transformers [26.775918851867246]
We propose TransMix, which mixes labels based on the attention maps of Vision Transformers.
The confidence of the label will be larger if the corresponding input image is weighted higher by the attention map.
TransMix consistently improves various ViT-based models at scales on ImageNet classification.
arXiv Detail & Related papers (2021-11-18T17:59:42Z) - SnapMix: Semantically Proportional Mixing for Augmenting Fine-grained
Data [124.95585891086894]
Proposal is called Semantically Proportional Mixing (SnapMix)
It exploits class activation map (CAM) to lessen the label noise in augmenting fine-grained data.
Our method consistently outperforms existing mixed-based approaches.
arXiv Detail & Related papers (2020-12-09T03:37:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.