Exploring and Improving Mobile Level Vision Transformers
- URL: http://arxiv.org/abs/2108.13015v1
- Date: Mon, 30 Aug 2021 06:42:49 GMT
- Title: Exploring and Improving Mobile Level Vision Transformers
- Authors: Pengguang Chen, Yixin Chen, Shu Liu, Mingchang Yang, Jiaya Jia
- Abstract summary: We study the vision transformer structure in the mobile level in this paper, and find a dramatic performance drop.
We propose a novel irregular patch embedding module and adaptive patch fusion module to improve the performance.
- Score: 81.7741384218121
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the vision transformer structure in the mobile level in this paper,
and find a dramatic performance drop. We analyze the reason behind this
phenomenon, and propose a novel irregular patch embedding module and adaptive
patch fusion module to improve the performance. We conjecture that the vision
transformer blocks (which consist of multi-head attention and feed-forward
network) are more suitable to handle high-level information than low-level
features. The irregular patch embedding module extracts patches that contain
rich high-level information with different receptive fields. The transformer
blocks can obtain the most useful information from these irregular patches.
Then the processed patches pass the adaptive patch merging module to get the
final features for the classifier. With our proposed improvements, the
traditional uniform vision transformer structure can achieve state-of-the-art
results in mobile level. We improve the DeiT baseline by more than 9\% under
the mobile-level settings and surpass other transformer architectures like Swin
and CoaT by a large margin.
Related papers
- Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers [56.264673865476986]
This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models.
SLA improves the model's ability to capture dependencies between high-level abstract features and low-level details.
Our implementation extends the Transformer's functionality by enabling queries in a given layer to interact with keys and values from both the current layer and one preceding layer.
arXiv Detail & Related papers (2024-06-17T07:24:38Z) - What Makes for Good Tokenizers in Vision Transformer? [62.44987486771936]
transformers are capable of extracting their pairwise relationships using self-attention.
What makes for a good tokenizer has not been well understood in computer vision.
Modulation across Tokens (MoTo) incorporates inter-token modeling capability through normalization.
Regularization objective TokenProp is embraced in the standard training regime.
arXiv Detail & Related papers (2022-12-21T15:51:43Z) - SIM-Trans: Structure Information Modeling Transformer for Fine-grained
Visual Categorization [59.732036564862796]
We propose the Structure Information Modeling Transformer (SIM-Trans) to incorporate object structure information into transformer for enhancing discriminative representation learning.
The proposed two modules are light-weighted and can be plugged into any transformer network and trained end-to-end easily.
Experiments and analyses demonstrate that the proposed SIM-Trans achieves state-of-the-art performance on fine-grained visual categorization benchmarks.
arXiv Detail & Related papers (2022-08-31T03:00:07Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Improve Vision Transformers Training by Suppressing Over-smoothing [28.171262066145612]
Introducing the transformer structure into computer vision tasks holds the promise of yielding a better speed-accuracy trade-off than traditional convolution networks.
However, directly training vanilla transformers on vision tasks has been shown to yield unstable and sub-optimal results.
Recent works propose to modify transformer structures by incorporating convolutional layers to improve the performance on vision tasks.
arXiv Detail & Related papers (2021-04-26T17:43:04Z) - Vision Transformers for Dense Prediction [77.34726150561087]
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks.
Our experiments show that this architecture yields substantial improvements on dense prediction tasks.
arXiv Detail & Related papers (2021-03-24T18:01:17Z) - Incorporating Convolution Designs into Visual Transformers [24.562955955312187]
We propose a new textbfConvolution-enhanced image Transformer (CeiT) which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies.
Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers.
arXiv Detail & Related papers (2021-03-22T13:16:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.