HorNet: Efficient High-Order Spatial Interactions with Recursive Gated
Convolutions
- URL: http://arxiv.org/abs/2207.14284v1
- Date: Thu, 28 Jul 2022 17:59:02 GMT
- Title: HorNet: Efficient High-Order Spatial Interactions with Recursive Gated
Convolutions
- Authors: Yongming Rao, Wenliang Zhao, Yansong Tang, Jie Zhou, Ser-Nam Lim,
Jiwen Lu
- Abstract summary: We show that input-adaptive, long-range and high-order spatial interactions can be efficiently implemented with a convolution-based framework.
We present the Recursive Gated Convolution ($textitgtextitn$Conv) that performs high-order spatial interactions with gated convolutions.
Based on the operation, we construct a new family of generic vision backbones named HorNet.
- Score: 109.33112814212129
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent progress in vision Transformers exhibits great success in various
tasks driven by the new spatial modeling mechanism based on dot-product
self-attention. In this paper, we show that the key ingredients behind the
vision Transformers, namely input-adaptive, long-range and high-order spatial
interactions, can also be efficiently implemented with a convolution-based
framework. We present the Recursive Gated Convolution
($\textit{g}^\textit{n}$Conv) that performs high-order spatial interactions
with gated convolutions and recursive designs. The new operation is highly
flexible and customizable, which is compatible with various variants of
convolution and extends the two-order interactions in self-attention to
arbitrary orders without introducing significant extra computation.
$\textit{g}^\textit{n}$Conv can serve as a plug-and-play module to improve
various vision Transformers and convolution-based models. Based on the
operation, we construct a new family of generic vision backbones named HorNet.
Extensive experiments on ImageNet classification, COCO object detection and
ADE20K semantic segmentation show HorNet outperform Swin Transformers and
ConvNeXt by a significant margin with similar overall architecture and training
configurations. HorNet also shows favorable scalability to more training data
and a larger model size. Apart from the effectiveness in visual encoders, we
also show $\textit{g}^\textit{n}$Conv can be applied to task-specific decoders
and consistently improve dense prediction performance with less computation.
Our results demonstrate that $\textit{g}^\textit{n}$Conv can be a new basic
module for visual modeling that effectively combines the merits of both vision
Transformers and CNNs. Code is available at
https://github.com/raoyongming/HorNet
Related papers
- ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical
Image Segmentation [10.727162449071155]
We build CNN-style Transformers (ConvFormer) to promote better attention convergence and thus better segmentation performance.
In contrast to positional embedding and tokenization, ConvFormer adopts 2D convolution and max-pooling for both position information preservation and feature size reduction.
arXiv Detail & Related papers (2023-09-09T02:18:17Z) - Dynamic Grained Encoder for Vision Transformers [150.02797954201424]
This paper introduces sparse queries for vision transformers to exploit the intrinsic spatial redundancy of natural images.
We propose a Dynamic Grained for vision transformers, which can adaptively assign a suitable number of queries to each spatial region.
Our encoder allows the state-of-the-art vision transformers to reduce computational complexity by 40%-60% while maintaining comparable performance on image classification.
arXiv Detail & Related papers (2023-01-10T07:55:29Z) - A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism.
We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions.
With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z) - Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition [158.15602882426379]
This paper does not attempt to design a state-of-the-art method for visual recognition but investigates a more efficient way to make use of convolutions to encode spatial features.
By comparing the design principles of the recent convolutional neural networks ConvNets) and Vision Transformers, we propose to simplify the self-attention by leveraging a convolutional modulation operation.
arXiv Detail & Related papers (2022-11-22T01:39:45Z) - Lawin Transformer: Improving Semantic Segmentation Transformer with
Multi-Scale Representations via Large Window Attention [16.75003034164463]
Multi-scale representations are crucial for semantic segmentation.
In this paper, we introduce multi-scale representations into semantic segmentation ViT via window attention mechanism.
Our resulting ViT, Lawin Transformer, is composed of an efficient vision transformer (HVT) as encoder and a LawinASPP as decoder.
arXiv Detail & Related papers (2022-01-05T13:51:20Z) - Local-to-Global Self-Attention in Vision Transformers [130.0369761612812]
Transformers have demonstrated great potential in computer vision tasks.
Some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows.
This design significantly improves the efficiency but lacks global feature reasoning in early stages.
In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage.
arXiv Detail & Related papers (2021-07-10T02:34:55Z) - Global Filter Networks for Image Classification [90.81352483076323]
We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity.
Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
arXiv Detail & Related papers (2021-07-01T17:58:16Z) - Glance-and-Gaze Vision Transformer [13.77016463781053]
We propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer)
It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes.
We empirically demonstrate our method achieves consistently superior performance over previous state-of-the-art Transformers.
arXiv Detail & Related papers (2021-06-04T06:13:47Z) - CvT: Introducing Convolutions to Vision Transformers [44.74550305869089]
Convolutional vision Transformer (CvT) improves Vision Transformer (ViT) in performance and efficiency.
New architecture introduces convolutions into ViT to yield the best of both designs.
arXiv Detail & Related papers (2021-03-29T17:58:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.