Visual Attention Network
- URL: http://arxiv.org/abs/2202.09741v1
- Date: Sun, 20 Feb 2022 06:35:18 GMT
- Title: Visual Attention Network
- Authors: Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng and Shi-Min
Hu
- Abstract summary: We propose a novel large kernel attention (LKA) module to enable self-adaptive and long-range correlations in self-attention.
We also introduce a novel neural network based on LKA, namely Visual Attention Network (VAN)
VAN outperforms the state-of-the-art vision transformers and convolutional neural networks with a large margin in extensive experiments.
- Score: 90.0753726786985
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While originally designed for natural language processing (NLP) tasks, the
self-attention mechanism has recently taken various computer vision areas by
storm. However, the 2D nature of images brings three challenges for applying
self-attention in computer vision. (1) Treating images as 1D sequences neglects
their 2D structures. (2) The quadratic complexity is too expensive for
high-resolution images. (3) It only captures spatial adaptability but ignores
channel adaptability. In this paper, we propose a novel large kernel attention
(LKA) module to enable self-adaptive and long-range correlations in
self-attention while avoiding the above issues. We further introduce a novel
neural network based on LKA, namely Visual Attention Network (VAN). While
extremely simple and efficient, VAN outperforms the state-of-the-art vision
transformers and convolutional neural networks with a large margin in extensive
experiments, including image classification, object detection, semantic
segmentation, instance segmentation, etc. Code is available at
https://github.com/Visual-Attention-Network.
Related papers
- Learning 1D Causal Visual Representation with De-focus Attention Networks [108.72931590504406]
This paper explores the feasibility of representing images using 1D causal modeling.
We propose De-focus Attention Networks, which employ learnable bandpass filters to create varied attention patterns.
arXiv Detail & Related papers (2024-06-06T17:59:56Z) - ELA: Efficient Local Attention for Deep Convolutional Neural Networks [15.976475674061287]
This paper introduces an Efficient Local Attention (ELA) method that achieves substantial performance improvements with a simple structure.
To overcome these challenges, we propose the incorporation of 1D convolution and Group Normalization feature enhancement techniques.
ELA can be seamlessly integrated into deep CNN networks such as ResNet, MobileNet, and DeepLab.
arXiv Detail & Related papers (2024-03-02T08:06:18Z) - TEC-Net: Vision Transformer Embrace Convolutional Neural Networks for
Medical Image Segmentation [20.976167468217387]
We propose vision Transformer embrace convolutional neural networks for medical image segmentation (TEC-Net)
Our network has two advantages. First, dynamic deformable convolution (DDConv) is designed in the CNN branch, which not only overcomes the difficulty of adaptive feature extraction using fixed-size convolution kernels, but also solves the defect that different inputs share the same convolution kernel parameters.
Experimental results show that the proposed TEC-Net provides better medical image segmentation results than SOTA methods including CNN and Transformer networks.
arXiv Detail & Related papers (2023-06-07T01:14:16Z) - Neural Feature Fusion Fields: 3D Distillation of Self-Supervised 2D
Image Representations [92.88108411154255]
We present a method that improves dense 2D image feature extractors when the latter are applied to the analysis of multiple images reconstructible as a 3D scene.
We show that our method not only enables semantic understanding in the context of scene-specific neural fields without the use of manual labels, but also consistently improves over the self-supervised 2D baselines.
arXiv Detail & Related papers (2022-09-07T23:24:09Z) - Two-Stream Graph Convolutional Network for Intra-oral Scanner Image
Segmentation [133.02190910009384]
We propose a two-stream graph convolutional network (i.e., TSGCN) to handle inter-view confusion between different raw attributes.
Our TSGCN significantly outperforms state-of-the-art methods in 3D tooth (surface) segmentation.
arXiv Detail & Related papers (2022-04-19T10:41:09Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - PC-RGNN: Point Cloud Completion and Graph Neural Network for 3D Object
Detection [57.49788100647103]
LiDAR-based 3D object detection is an important task for autonomous driving.
Current approaches suffer from sparse and partial point clouds of distant and occluded objects.
In this paper, we propose a novel two-stage approach, namely PC-RGNN, dealing with such challenges by two specific solutions.
arXiv Detail & Related papers (2020-12-18T18:06:43Z) - Deep Learning of Unified Region, Edge, and Contour Models for Automated
Image Segmentation [2.0305676256390934]
convolutional neural networks (CNNs) have gained traction in the design of automated segmentation pipelines.
CNN-based models are adept at learning abstract features from raw image data, but their performance is dependent on the availability and size of suitable training datasets.
In this thesis, we devise novel methodologies that address these issues and establish robust representation learning frameworks for fully-automatic semantic segmentation.
arXiv Detail & Related papers (2020-06-23T02:54:55Z) - Self-supervised Feature Learning by Cross-modality and Cross-view
Correspondences [32.01548991331616]
This paper presents a novel self-supervised learning approach to learn both 2D image features and 3D point cloud features.
It exploits cross-modality and cross-view correspondences without using any annotated human labels.
The effectiveness of the learned 2D and 3D features is evaluated by transferring them on five different tasks.
arXiv Detail & Related papers (2020-04-13T02:57:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.