ParCNetV2: Oversized Kernel with Enhanced Attention
- URL: http://arxiv.org/abs/2211.07157v1
- Date: Mon, 14 Nov 2022 07:22:55 GMT
- Title: ParCNetV2: Oversized Kernel with Enhanced Attention
- Authors: Ruihan Xu, Haokui Zhang, Wenze Hu, Shiliang Zhang, Xiaoyu Wang
- Abstract summary: We introduce a convolutional neural network architecture named ParCNetV2.
It extends position-aware circular convolution (ParCNet) with oversized convolutions and strengthens attention through bifurcate gate units.
Our method outperforms other pure convolutional neural networks as well as neural networks hybridizing CNNs and transformers.
- Score: 60.141606180434195
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers have achieved tremendous success in various computer vision
tasks. By borrowing design concepts from transformers, many studies
revolutionized CNNs and showed remarkable results. This paper falls in this
line of studies. More specifically, we introduce a convolutional neural network
architecture named ParCNetV2, which extends position-aware circular convolution
(ParCNet) with oversized convolutions and strengthens attention through
bifurcate gate units. The oversized convolution utilizes a kernel with
$2\times$ the input size to model long-range dependencies through a global
receptive field. Simultaneously, it achieves implicit positional encoding by
removing the shift-invariant property from convolutional kernels, i.e., the
effective kernels at different spatial locations are different when the kernel
size is twice as large as the input size. The bifurcate gate unit implements an
attention mechanism similar to self-attention in transformers. It splits the
input into two branches, one serves as feature transformation while the other
serves as attention weights. The attention is applied through element-wise
multiplication of the two branches. Besides, we introduce a unified
local-global convolution block to unify the design of the early and late stage
convolutional blocks. Extensive experiments demonstrate that our method
outperforms other pure convolutional neural networks as well as neural networks
hybridizing CNNs and transformers.
Related papers
- Shift-ConvNets: Small Convolutional Kernel with Large Kernel Effects [8.933264104073832]
Small convolutional kernels and convolution operations can achieve the closing effects of large kernel sizes.
We propose a shift-wise operator that ensures the CNNs capture long-range dependencies with the help of the sparse mechanism.
On the ImageNet-1k, our shift-wise enhanced CNN model outperforms the state-of-the-art models.
arXiv Detail & Related papers (2024-01-23T13:13:45Z) - ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical
Image Segmentation [10.727162449071155]
We build CNN-style Transformers (ConvFormer) to promote better attention convergence and thus better segmentation performance.
In contrast to positional embedding and tokenization, ConvFormer adopts 2D convolution and max-pooling for both position information preservation and feature size reduction.
arXiv Detail & Related papers (2023-09-09T02:18:17Z) - Omni-Dimensional Dynamic Convolution [25.78940854339179]
Learning a single static convolutional kernel in each convolutional layer is the common training paradigm of modern Convolutional Neural Networks (CNNs)
Recent research in dynamic convolution shows that learning a linear combination of $n$ convolutional kernels weighted with their input-dependent attentions can significantly improve the accuracy of light-weight CNNs.
We present Omni-dimensional Dynamic Convolution (ODConv), a more generalized yet elegant dynamic convolution design.
arXiv Detail & Related papers (2022-09-16T14:05:38Z) - Adaptive Split-Fusion Transformer [90.04885335911729]
We propose an Adaptive Split-Fusion Transformer (ASF-former) to treat convolutional and attention branches differently with adaptive weights.
Experiments on standard benchmarks, such as ImageNet-1K, show that our ASF-former outperforms its CNN, transformer counterparts, and hybrid pilots in terms of accuracy.
arXiv Detail & Related papers (2022-04-26T10:00:28Z) - Local-to-Global Self-Attention in Vision Transformers [130.0369761612812]
Transformers have demonstrated great potential in computer vision tasks.
Some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows.
This design significantly improves the efficiency but lacks global feature reasoning in early stages.
In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage.
arXiv Detail & Related papers (2021-07-10T02:34:55Z) - X-volution: On the unification of convolution and self-attention [52.80459687846842]
We propose a multi-branch elementary module composed of both convolution and self-attention operation.
The proposed X-volution achieves highly competitive visual understanding improvements.
arXiv Detail & Related papers (2021-06-04T04:32:02Z) - Hyper-Convolution Networks for Biomedical Image Segmentation [22.902923145462008]
The size of the convolution kernels determines both the expressiveness of convolutional neural networks (CNN) and the number of learnable parameters.
We propose a powerful novel building block, the hyper-convolution, which implicitly represents the convolution kernel as a function of kernel coordinates.
We demonstrate that replacing regular convolutions with hyper-convolutions leads to more efficient architectures that achieve improved accuracy.
arXiv Detail & Related papers (2021-05-21T20:31:08Z) - LocalViT: Bringing Locality to Vision Transformers [132.42018183859483]
locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects.
We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network.
This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks.
arXiv Detail & Related papers (2021-04-12T17:59:22Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.