Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition
- URL: http://arxiv.org/abs/2211.11943v1
- Date: Tue, 22 Nov 2022 01:39:45 GMT
- Title: Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition
- Authors: Qibin Hou, Cheng-Ze Lu, Ming-Ming Cheng, Jiashi Feng
- Abstract summary: This paper does not attempt to design a state-of-the-art method for visual recognition but investigates a more efficient way to make use of convolutions to encode spatial features.
By comparing the design principles of the recent convolutional neural networks ConvNets) and Vision Transformers, we propose to simplify the self-attention by leveraging a convolutional modulation operation.
- Score: 158.15602882426379
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper does not attempt to design a state-of-the-art method for visual
recognition but investigates a more efficient way to make use of convolutions
to encode spatial features. By comparing the design principles of the recent
convolutional neural networks ConvNets) and Vision Transformers, we propose to
simplify the self-attention by leveraging a convolutional modulation operation.
We show that such a simple approach can better take advantage of the large
kernels (>=7x7) nested in convolutional layers. We build a family of
hierarchical ConvNets using the proposed convolutional modulation, termed
Conv2Former. Our network is simple and easy to follow. Experiments show that
our Conv2Former outperforms existent popular ConvNets and vision Transformers,
like Swin Transformer and ConvNeXt in all ImageNet classification, COCO object
detection and ADE20k semantic segmentation.
Related papers
- Interpret Vision Transformers as ConvNets with Dynamic Convolutions [70.59235381143831]
We interpret vision Transformers as ConvNets with dynamic convolutions, which enables us to characterize existing Transformers and dynamic ConvNets in a unified framework.
Our interpretation can also guide the network design as researchers now can consider vision Transformers from the design space of ConvNets.
arXiv Detail & Related papers (2023-09-19T16:00:49Z) - Are Large Kernels Better Teachers than Transformers for ConvNets? [82.4742785108714]
This paper reveals a new appeal of the recently emerged large-kernel Convolutional Neural Networks (ConvNets): as the teacher in Knowledge Distillation (KD) for small- Kernel ConvNets.
arXiv Detail & Related papers (2023-05-30T21:05:23Z) - Fast-ParC: Capturing Position Aware Global Feature for ConvNets and ViTs [35.39701561076837]
We propose a new basic neural network operator named position-aware circular convolution (ParC) and its accelerated version Fast-ParC.
Our Fast-ParC further reduces the O(n2) time complexity of ParC to O(n log n) using Fast Fourier Transform.
Experiment results show that our ParC op can effectively enlarge the receptive field of traditional ConvNets.
arXiv Detail & Related papers (2022-10-08T13:14:02Z) - HorNet: Efficient High-Order Spatial Interactions with Recursive Gated
Convolutions [109.33112814212129]
We show that input-adaptive, long-range and high-order spatial interactions can be efficiently implemented with a convolution-based framework.
We present the Recursive Gated Convolution ($textitgtextitn$Conv) that performs high-order spatial interactions with gated convolutions.
Based on the operation, we construct a new family of generic vision backbones named HorNet.
arXiv Detail & Related papers (2022-07-28T17:59:02Z) - A ConvNet for the 2020s [94.89735578018099]
Vision Transformers (ViTs) quickly superseded ConvNets as the state-of-the-art image classification model.
It is the hierarchical Transformers that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone.
In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve.
arXiv Detail & Related papers (2022-01-10T18:59:10Z) - Semi-Supervised Vision Transformers [76.83020291497895]
We study the training of Vision Transformers for semi-supervised image classification.
We find Vision Transformers perform poorly on a semi-supervised ImageNet setting.
CNNs achieve superior results in small labeled data regime.
arXiv Detail & Related papers (2021-11-22T09:28:13Z) - CvT: Introducing Convolutions to Vision Transformers [44.74550305869089]
Convolutional vision Transformer (CvT) improves Vision Transformer (ViT) in performance and efficiency.
New architecture introduces convolutions into ViT to yield the best of both designs.
arXiv Detail & Related papers (2021-03-29T17:58:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.