TransMix: Attend to Mix for Vision Transformers
- URL: http://arxiv.org/abs/2111.09833v1
- Date: Thu, 18 Nov 2021 17:59:42 GMT
- Title: TransMix: Attend to Mix for Vision Transformers
- Authors: Jie-Neng Chen, Shuyang Sun, Ju He, Philip Torr, Alan Yuille, Song Bai
- Abstract summary: We propose TransMix, which mixes labels based on the attention maps of Vision Transformers.
The confidence of the label will be larger if the corresponding input image is weighted higher by the attention map.
TransMix consistently improves various ViT-based models at scales on ImageNet classification.
- Score: 26.775918851867246
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mixup-based augmentation has been found to be effective for generalizing
models during training, especially for Vision Transformers (ViTs) since they
can easily overfit. However, previous mixup-based methods have an underlying
prior knowledge that the linearly interpolated ratio of targets should be kept
the same as the ratio proposed in input interpolation. This may lead to a
strange phenomenon that sometimes there is no valid object in the mixed image
due to the random process in augmentation but there is still response in the
label space. To bridge such gap between the input and label spaces, we propose
TransMix, which mixes labels based on the attention maps of Vision
Transformers. The confidence of the label will be larger if the corresponding
input image is weighted higher by the attention map. TransMix is embarrassingly
simple and can be implemented in just a few lines of code without introducing
any extra parameters and FLOPs to ViT-based models. Experimental results show
that our method can consistently improve various ViT-based models at scales on
ImageNet classification. After pre-trained with TransMix on ImageNet, the
ViT-based models also demonstrate better transferability to semantic
segmentation, object detection and instance segmentation. TransMix also
exhibits to be more robust when evaluating on 4 different benchmarks. Code will
be made publicly available at https://github.com/Beckschen/TransMix.
Related papers
- Cross-modal Orthogonal High-rank Augmentation for RGB-Event
Transformer-trackers [58.802352477207094]
We explore the great potential of a pre-trained vision Transformer (ViT) to bridge the vast distribution gap between two modalities.
We propose a mask modeling strategy that randomly masks a specific modality of some tokens to enforce the interaction between tokens from different modalities interacting proactively.
Experiments demonstrate that our plug-and-play training augmentation techniques can significantly boost state-of-the-art one-stream and two trackersstream to a large extent in terms of both tracking precision and success rate.
arXiv Detail & Related papers (2023-07-09T08:58:47Z) - MixPro: Data Augmentation with MaskMix and Progressive Attention
Labeling for Vision Transformer [17.012278767127967]
We propose MaskMix and Progressive Attention Labeling in image and label space.
From the perspective of image space, we design MaskMix, which mixes two images based on a patch-like grid mask.
From the perspective of label space, we design PAL, which utilizes a progressive factor to dynamically re-weight the attention weights of the mixed attention label.
arXiv Detail & Related papers (2023-04-24T12:38:09Z) - SMMix: Self-Motivated Image Mixing for Vision Transformers [65.809376136455]
CutMix is a vital augmentation strategy that determines the performance and generalization ability of vision transformers (ViTs)
Existing CutMix variants tackle this problem by generating more consistent mixed images or more precise mixed labels.
We propose an efficient and effective Self-Motivated image Mixing method (SMMix) which motivates both image and label enhancement by the model under training itself.
arXiv Detail & Related papers (2022-12-26T00:19:39Z) - OAMixer: Object-aware Mixing Layer for Vision Transformers [73.10651373341933]
We propose OAMixer, which calibrates the patch mixing layers of patch-based models based on the object labels.
By learning an object-centric representation, we demonstrate that OAMixer improves the classification accuracy and background robustness of various patch-based models.
arXiv Detail & Related papers (2022-12-13T14:14:48Z) - TokenMixup: Efficient Attention-guided Token-level Data Augmentation for
Transformers [8.099977107670917]
TokenMixup is an efficient attention-guided token-level data augmentation method.
A variant of TokenMixup mixes tokens within a single instance, thereby enabling multi-scale feature augmentation.
Experiments show that our methods significantly improve the baseline models' performance on CIFAR and ImageNet-1K.
arXiv Detail & Related papers (2022-10-14T06:36:31Z) - Token-Label Alignment for Vision Transformers [93.58540411138164]
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs)
We identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies.
We propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token.
arXiv Detail & Related papers (2022-10-12T17:54:32Z) - Patches Are All You Need? [96.88889685873106]
Vision Transformer (ViT) models may exceed their performance in some settings.
ViTs require the use of patch embeddings, which group together small regions of the image into single input features.
This raises a question: Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation?
arXiv Detail & Related papers (2022-01-24T16:42:56Z) - Mixup-Transformer: Dynamic Data Augmentation for NLP Tasks [75.69896269357005]
Mixup is the latest data augmentation technique that linearly interpolates input examples and the corresponding labels.
In this paper, we explore how to apply mixup to natural language processing tasks.
We incorporate mixup to transformer-based pre-trained architecture, named "mixup-transformer", for a wide range of NLP tasks.
arXiv Detail & Related papers (2020-10-05T23:37:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.