Fast-ParC: Capturing Position Aware Global Feature for ConvNets and ViTs
- URL: http://arxiv.org/abs/2210.04020v2
- Date: Thu, 30 Nov 2023 13:05:34 GMT
- Title: Fast-ParC: Capturing Position Aware Global Feature for ConvNets and ViTs
- Authors: Tao Yang, Haokui Zhang, Wenze Hu, Changwen Chen, Xiaoyu Wang
- Abstract summary: We propose a new basic neural network operator named position-aware circular convolution (ParC) and its accelerated version Fast-ParC.
Our Fast-ParC further reduces the O(n2) time complexity of ParC to O(n log n) using Fast Fourier Transform.
Experiment results show that our ParC op can effectively enlarge the receptive field of traditional ConvNets.
- Score: 35.39701561076837
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer models have made tremendous progress in various fields in recent
years. In the field of computer vision, vision transformers (ViTs) also become
strong alternatives to convolutional neural networks (ConvNets), yet they have
not been able to replace ConvNets since both have their own merits. For
instance, ViTs are good at extracting global features with attention mechanisms
while ConvNets are more efficient in modeling local relationships due to their
strong inductive bias. A natural idea that arises is to combine the strengths
of both ConvNets and ViTs to design new structures. In this paper, we propose a
new basic neural network operator named position-aware circular convolution
(ParC) and its accelerated version Fast-ParC. The ParC operator can capture
global features by using a global kernel and circular convolution while keeping
location sensitiveness by employing position embeddings. Our Fast-ParC further
reduces the O(n2) time complexity of ParC to O(n log n) using Fast Fourier
Transform. This acceleration makes it possible to use global convolution in the
early stages of models with large feature maps, yet still maintains the overall
computational cost comparable with using 3x3 or 7x7 kernels. The proposed
operation can be used in a plug-and-play manner to 1) convert ViTs to
pure-ConvNet architecture to enjoy wider hardware support and achieve higher
inference speed; 2) replacing traditional convolutions in the deep stage of
ConvNets to improve accuracy by enlarging the effective receptive field.
Experiment results show that our ParC op can effectively enlarge the receptive
field of traditional ConvNets, and adopting the proposed op benefits both ViTs
and ConvNet models on all three popular vision tasks, image classification,
object
Related papers
- Interpret Vision Transformers as ConvNets with Dynamic Convolutions [70.59235381143831]
We interpret vision Transformers as ConvNets with dynamic convolutions, which enables us to characterize existing Transformers and dynamic ConvNets in a unified framework.
Our interpretation can also guide the network design as researchers now can consider vision Transformers from the design space of ConvNets.
arXiv Detail & Related papers (2023-09-19T16:00:49Z) - Are Large Kernels Better Teachers than Transformers for ConvNets? [82.4742785108714]
This paper reveals a new appeal of the recently emerged large-kernel Convolutional Neural Networks (ConvNets): as the teacher in Knowledge Distillation (KD) for small- Kernel ConvNets.
arXiv Detail & Related papers (2023-05-30T21:05:23Z) - Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition [158.15602882426379]
This paper does not attempt to design a state-of-the-art method for visual recognition but investigates a more efficient way to make use of convolutions to encode spatial features.
By comparing the design principles of the recent convolutional neural networks ConvNets) and Vision Transformers, we propose to simplify the self-attention by leveraging a convolutional modulation operation.
arXiv Detail & Related papers (2022-11-22T01:39:45Z) - MogaNet: Multi-order Gated Aggregation Network [64.16774341908365]
We propose a new family of modern ConvNets, dubbed MogaNet, for discriminative visual representation learning.
MogaNet encapsulates conceptually simple yet effective convolutions and gated aggregation into a compact module.
MogaNet exhibits great scalability, impressive efficiency of parameters, and competitive performance compared to state-of-the-art ViTs and ConvNets on ImageNet.
arXiv Detail & Related papers (2022-11-07T04:31:17Z) - VidConv: A modernized 2D ConvNet for Efficient Video Recognition [0.8070014188337304]
Vision Transformers (ViT) have been steadily breaking the record for many vision tasks.
ViTs are generally computational, memory-consuming, and unfriendly for embedded devices.
In this paper, we adopt the modernized structure of ConvNet to design a new backbone for action recognition.
arXiv Detail & Related papers (2022-07-08T09:33:46Z) - EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision
Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision.
We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z) - EdgeFormer: Improving Light-weight ConvNets by Learning from Vision
Transformers [29.09883780571206]
We propose EdgeFormer, a pure ConvNet based backbone model.
We combine the global circular convolution (GCC) with position embeddings, a light-weight convolution op.
Experiment results show that the proposed EdgeFormer achieves better performance than popular light-weight ConvNets and vision transformer based models.
arXiv Detail & Related papers (2022-03-08T09:25:17Z) - LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference [25.63398340113755]
We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime.
We introduce the attention bias, a new way to integrate positional information in vision transformers.
Overall, LeViT significantly outperforms existing convnets and vision transformers with respect to the speed/accuracy tradeoff.
arXiv Detail & Related papers (2021-04-02T16:29:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.