Related papers: Fast-ParC: Capturing Position Aware Global Feature for ConvNets and ViTs

Fast-ParC: Capturing Position Aware Global Feature for ConvNets and ViTs

URL: http://arxiv.org/abs/2210.04020v2
Date: Thu, 30 Nov 2023 13:05:34 GMT
Title: Fast-ParC: Capturing Position Aware Global Feature for ConvNets and ViTs
Authors: Tao Yang, Haokui Zhang, Wenze Hu, Changwen Chen, Xiaoyu Wang
Abstract summary: We propose a new basic neural network operator named position-aware circular convolution (ParC) and its accelerated version Fast-ParC. Our Fast-ParC further reduces the O(n2) time complexity of ParC to O(n log n) using Fast Fourier Transform. Experiment results show that our ParC op can effectively enlarge the receptive field of traditional ConvNets.
Score: 35.39701561076837
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformer models have made tremendous progress in various fields in recent years. In the field of computer vision, vision transformers (ViTs) also become strong alternatives to convolutional neural networks (ConvNets), yet they have not been able to replace ConvNets since both have their own merits. For instance, ViTs are good at extracting global features with attention mechanisms while ConvNets are more efficient in modeling local relationships due to their strong inductive bias. A natural idea that arises is to combine the strengths of both ConvNets and ViTs to design new structures. In this paper, we propose a new basic neural network operator named position-aware circular convolution (ParC) and its accelerated version Fast-ParC. The ParC operator can capture global features by using a global kernel and circular convolution while keeping location sensitiveness by employing position embeddings. Our Fast-ParC further reduces the O(n2) time complexity of ParC to O(n log n) using Fast Fourier Transform. This acceleration makes it possible to use global convolution in the early stages of models with large feature maps, yet still maintains the overall computational cost comparable with using 3x3 or 7x7 kernels. The proposed operation can be used in a plug-and-play manner to 1) convert ViTs to pure-ConvNet architecture to enjoy wider hardware support and achieve higher inference speed; 2) replacing traditional convolutions in the deep stage of ConvNets to improve accuracy by enlarging the effective receptive field. Experiment results show that our ParC op can effectively enlarge the receptive field of traditional ConvNets, and adopting the proposed op benefits both ViTs and ConvNet models on all three popular vision tasks, image classification, object

Related papers

Interpret Vision Transformers as ConvNets with Dynamic Convolutions [70.59235381143831]
We interpret vision Transformers as ConvNets with dynamic convolutions, which enables us to characterize existing Transformers and dynamic ConvNets in a unified framework. Our interpretation can also guide the network design as researchers now can consider vision Transformers from the design space of ConvNets.
arXiv Detail & Related papers (2023-09-19T16:00:49Z)
Are Large Kernels Better Teachers than Transformers for ConvNets? [82.4742785108714]
This paper reveals a new appeal of the recently emerged large-kernel Convolutional Neural Networks (ConvNets): as the teacher in Knowledge Distillation (KD) for small- Kernel ConvNets.
arXiv Detail & Related papers (2023-05-30T21:05:23Z)
Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition [158.15602882426379]
This paper does not attempt to design a state-of-the-art method for visual recognition but investigates a more efficient way to make use of convolutions to encode spatial features. By comparing the design principles of the recent convolutional neural networks ConvNets) and Vision Transformers, we propose to simplify the self-attention by leveraging a convolutional modulation operation.
arXiv Detail & Related papers (2022-11-22T01:39:45Z)
MogaNet: Multi-order Gated Aggregation Network [64.16774341908365]
We propose a new family of modern ConvNets, dubbed MogaNet, for discriminative visual representation learning. MogaNet encapsulates conceptually simple yet effective convolutions and gated aggregation into a compact module. MogaNet exhibits great scalability, impressive efficiency of parameters, and competitive performance compared to state-of-the-art ViTs and ConvNets on ImageNet.
arXiv Detail & Related papers (2022-11-07T04:31:17Z)
VidConv: A modernized 2D ConvNet for Efficient Video Recognition [0.8070014188337304]
Vision Transformers (ViT) have been steadily breaking the record for many vision tasks. ViTs are generally computational, memory-consuming, and unfriendly for embedded devices. In this paper, we adopt the modernized structure of ConvNet to design a new backbone for action recognition.
arXiv Detail & Related papers (2022-07-08T09:33:46Z)
EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision. We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z)
EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers [29.09883780571206]
We propose EdgeFormer, a pure ConvNet based backbone model. We combine the global circular convolution (GCC) with position embeddings, a light-weight convolution op. Experiment results show that the proposed EdgeFormer achieves better performance than popular light-weight ConvNets and vision transformer based models.
arXiv Detail & Related papers (2022-03-08T09:25:17Z)
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference [25.63398340113755]
We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. We introduce the attention bias, a new way to integrate positional information in vision transformers. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect to the speed/accuracy tradeoff.
arXiv Detail & Related papers (2021-04-02T16:29:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.