Related papers: ParCNetV2: Oversized Kernel with Enhanced Attention

ParCNetV2: Oversized Kernel with Enhanced Attention

URL: http://arxiv.org/abs/2211.07157v1
Date: Mon, 14 Nov 2022 07:22:55 GMT
Title: ParCNetV2: Oversized Kernel with Enhanced Attention
Authors: Ruihan Xu, Haokui Zhang, Wenze Hu, Shiliang Zhang, Xiaoyu Wang
Abstract summary: We introduce a convolutional neural network architecture named ParCNetV2. It extends position-aware circular convolution (ParCNet) with oversized convolutions and strengthens attention through bifurcate gate units. Our method outperforms other pure convolutional neural networks as well as neural networks hybridizing CNNs and transformers.
Score: 60.141606180434195
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers have achieved tremendous success in various computer vision tasks. By borrowing design concepts from transformers, many studies revolutionized CNNs and showed remarkable results. This paper falls in this line of studies. More specifically, we introduce a convolutional neural network architecture named ParCNetV2, which extends position-aware circular convolution (ParCNet) with oversized convolutions and strengthens attention through bifurcate gate units. The oversized convolution utilizes a kernel with $2\times$ the input size to model long-range dependencies through a global receptive field. Simultaneously, it achieves implicit positional encoding by removing the shift-invariant property from convolutional kernels, i.e., the effective kernels at different spatial locations are different when the kernel size is twice as large as the input size. The bifurcate gate unit implements an attention mechanism similar to self-attention in transformers. It splits the input into two branches, one serves as feature transformation while the other serves as attention weights. The attention is applied through element-wise multiplication of the two branches. Besides, we introduce a unified local-global convolution block to unify the design of the early and late stage convolutional blocks. Extensive experiments demonstrate that our method outperforms other pure convolutional neural networks as well as neural networks hybridizing CNNs and transformers.

Related papers

Shift-ConvNets: Small Convolutional Kernel with Large Kernel Effects [8.933264104073832]
Small convolutional kernels and convolution operations can achieve the closing effects of large kernel sizes. We propose a shift-wise operator that ensures the CNNs capture long-range dependencies with the help of the sparse mechanism. On the ImageNet-1k, our shift-wise enhanced CNN model outperforms the state-of-the-art models.
arXiv Detail & Related papers (2024-01-23T13:13:45Z)
Gramformer: Learning Crowd Counting via Graph-Modulated Transformer [68.26599222077466]
Gramformer is a graph-modulated transformer to enhance the network by adjusting the attention and input node features respectively. A feature-based encoding is proposed to discover the centrality positions or importance of nodes. Experiments on four challenging crowd counting datasets have validated the competitiveness of the proposed method.
arXiv Detail & Related papers (2024-01-08T13:01:54Z)
ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical Image Segmentation [10.727162449071155]
We build CNN-style Transformers (ConvFormer) to promote better attention convergence and thus better segmentation performance. In contrast to positional embedding and tokenization, ConvFormer adopts 2D convolution and max-pooling for both position information preservation and feature size reduction.
arXiv Detail & Related papers (2023-09-09T02:18:17Z)
Omni-Dimensional Dynamic Convolution [25.78940854339179]
Learning a single static convolutional kernel in each convolutional layer is the common training paradigm of modern Convolutional Neural Networks (CNNs) Recent research in dynamic convolution shows that learning a linear combination of $n$ convolutional kernels weighted with their input-dependent attentions can significantly improve the accuracy of light-weight CNNs. We present Omni-dimensional Dynamic Convolution (ODConv), a more generalized yet elegant dynamic convolution design.
arXiv Detail & Related papers (2022-09-16T14:05:38Z)
Adaptive Split-Fusion Transformer [90.04885335911729]
We propose an Adaptive Split-Fusion Transformer (ASF-former) to treat convolutional and attention branches differently with adaptive weights. Experiments on standard benchmarks, such as ImageNet-1K, show that our ASF-former outperforms its CNN, transformer counterparts, and hybrid pilots in terms of accuracy.
arXiv Detail & Related papers (2022-04-26T10:00:28Z)
Local-to-Global Self-Attention in Vision Transformers [130.0369761612812]
Transformers have demonstrated great potential in computer vision tasks. Some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows. This design significantly improves the efficiency but lacks global feature reasoning in early stages. In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage.
arXiv Detail & Related papers (2021-07-10T02:34:55Z)
X-volution: On the unification of convolution and self-attention [52.80459687846842]
We propose a multi-branch elementary module composed of both convolution and self-attention operation. The proposed X-volution achieves highly competitive visual understanding improvements.
arXiv Detail & Related papers (2021-06-04T04:32:02Z)
Hyper-Convolution Networks for Biomedical Image Segmentation [22.902923145462008]
The size of the convolution kernels determines both the expressiveness of convolutional neural networks (CNN) and the number of learnable parameters. We propose a powerful novel building block, the hyper-convolution, which implicitly represents the convolution kernel as a function of kernel coordinates. We demonstrate that replacing regular convolutions with hyper-convolutions leads to more efficient architectures that achieve improved accuracy.
arXiv Detail & Related papers (2021-05-21T20:31:08Z)
LocalViT: Bringing Locality to Vision Transformers [132.42018183859483]
locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects. We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network. This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks.
arXiv Detail & Related papers (2021-04-12T17:59:22Z)
Transformers Solve the Limited Receptive Field for Monocular Depth Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers. This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.