A Close Look at Spatial Modeling: From Attention to Convolution
- URL: http://arxiv.org/abs/2212.12552v1
- Date: Fri, 23 Dec 2022 19:13:43 GMT
- Title: A Close Look at Spatial Modeling: From Attention to Convolution
- Authors: Xu Ma, Huan Wang, Can Qin, Kunpeng Li, Xingchen Zhao, Jie Fu, Yun Fu
- Abstract summary: Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism.
We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions.
With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
- Score: 70.5571582194057
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformers have shown great promise recently for many vision tasks
due to the insightful architecture design and attention mechanism. By
revisiting the self-attention responses in Transformers, we empirically observe
two interesting issues. First, Vision Transformers present a queryirrelevant
behavior at deep layers, where the attention maps exhibit nearly consistent
contexts in global scope, regardless of the query patch position (also
head-irrelevant). Second, the attention maps are intrinsically sparse, few
tokens dominate the attention weights; introducing the knowledge from ConvNets
would largely smooth the attention and enhance the performance. Motivated by
above observations, we generalize self-attention formulation to abstract a
queryirrelevant global context directly and further integrate the global
context into convolutions. The resulting model, a Fully Convolutional Vision
Transformer (i.e., FCViT), purely consists of convolutional layers and firmly
inherits the merits of both attention mechanism and convolutions, including
dynamic property, weight sharing, and short- and long-range feature modeling,
etc. Experimental results demonstrate the effectiveness of FCViT. With less
than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7%
top1 accuracy on ImageNet-1K. When scaling FCViT to larger models, we still
perform better than previous state-of-the-art ConvNeXt with even fewer
parameters. FCViT-based models also demonstrate promising transferability to
downstream tasks, like object detection, instance segmentation, and semantic
segmentation. Codes and models are made available at:
https://github.com/ma-xu/FCViT.
Related papers
Err
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.