Conformer: Local Features Coupling Global Representations for Visual
Recognition
- URL: http://arxiv.org/abs/2105.03889v1
- Date: Sun, 9 May 2021 10:00:03 GMT
- Title: Conformer: Local Features Coupling Global Representations for Visual
Recognition
- Authors: Zhiliang Peng, Wei Huang, Shanzhi Gu, Lingxi Xie, Yaowei Wang, Jianbin
Jiao, Qixiang Ye
- Abstract summary: We propose a hybrid network structure, termed Conformer, to take advantage of convolutional operations and self-attention mechanisms for enhanced representation learning.
Experiments show that Conformer, under the comparable parameter complexity, outperforms the visual transformer (DeiT-B) by 2.3% on ImageNet.
- Score: 72.9550481476101
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Within Convolutional Neural Network (CNN), the convolution operations are
good at extracting local features but experience difficulty to capture global
representations. Within visual transformer, the cascaded self-attention modules
can capture long-distance feature dependencies but unfortunately deteriorate
local feature details. In this paper, we propose a hybrid network structure,
termed Conformer, to take advantage of convolutional operations and
self-attention mechanisms for enhanced representation learning. Conformer roots
in the Feature Coupling Unit (FCU), which fuses local features and global
representations under different resolutions in an interactive fashion.
Conformer adopts a concurrent structure so that local features and global
representations are retained to the maximum extent. Experiments show that
Conformer, under the comparable parameter complexity, outperforms the visual
transformer (DeiT-B) by 2.3% on ImageNet. On MSCOCO, it outperforms ResNet-101
by 3.7% and 3.6% mAPs for object detection and instance segmentation,
respectively, demonstrating the great potential to be a general backbone
network. Code is available at https://github.com/pengzhiliang/Conformer.
Related papers
- Double-Shot 3D Shape Measurement with a Dual-Branch Network [14.749887303860717]
We propose a dual-branch Convolutional Neural Network (CNN)-Transformer network (PDCNet) to process different structured light (SL) modalities.
Within PDCNet, a Transformer branch is used to capture global perception in the fringe images, while a CNN branch is designed to collect local details in the speckle images.
We show that our method can reduce fringe order ambiguity while producing high-accuracy results on a self-made dataset.
arXiv Detail & Related papers (2024-07-19T10:49:26Z) - CMTNet: Convolutional Meets Transformer Network for Hyperspectral Images Classification [3.821081081400729]
Current convolutional neural networks (CNNs) focus on local features in hyperspectral data.
Transformer framework excels at extracting global features from hyperspectral imagery.
This research introduces the Convolutional Meet Transformer Network (CMTNet)
arXiv Detail & Related papers (2024-06-20T07:56:51Z) - ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection [65.59969454655996]
We propose an efficient change detection framework, ELGC-Net, which leverages rich contextual information to precisely estimate change regions.
Our proposed ELGC-Net sets a new state-of-the-art performance in remote sensing change detection benchmarks.
We also introduce ELGC-Net-LW, a lighter variant with significantly reduced computational complexity, suitable for resource-constrained settings.
arXiv Detail & Related papers (2024-03-26T17:46:25Z) - APPT : Asymmetric Parallel Point Transformer for 3D Point Cloud
Understanding [20.87092793669536]
Transformer-based networks have achieved impressive performance in 3D point cloud understanding.
To tackle these problems, we propose Asymmetric Parallel Point Transformer (APPT)
APPT is able to capture features globally throughout the entire network while focusing on local-detailed features.
arXiv Detail & Related papers (2023-03-31T06:11:02Z) - Part-guided Relational Transformers for Fine-grained Visual Recognition [59.20531172172135]
We propose a framework to learn the discriminative part features and explore correlations with a feature transformation module.
Our proposed approach does not rely on additional part branches and reaches state-the-of-art performance on 3-of-the-level object recognition.
arXiv Detail & Related papers (2022-12-28T03:45:56Z) - MAFormer: A Transformer Network with Multi-scale Attention Fusion for
Visual Recognition [45.68567088645708]
We introduce Multi-scale Attention Fusion into transformer (MAFormer)
MAFormer explores local aggregation and global feature extraction in a dual-stream framework for visual recognition.
Our MAFormer achieves state-of-the-art performance on common vision tasks.
arXiv Detail & Related papers (2022-08-31T06:29:27Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - LocalViT: Bringing Locality to Vision Transformers [132.42018183859483]
locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects.
We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network.
This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks.
arXiv Detail & Related papers (2021-04-12T17:59:22Z) - Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for
Place Recognition [29.282413482297255]
This paper introduces Patch-NetVLAD, which provides a novel formulation for combining the advantages of both local and global descriptor methods.
We show that Patch-NetVLAD outperforms both global and local feature descriptor-based methods with comparable compute.
It is also adaptable to user requirements, with a speed-optimised version operating over an order of magnitude faster than the state-of-the-art.
arXiv Detail & Related papers (2021-03-02T05:53:32Z) - Visual Concept Reasoning Networks [93.99840807973546]
A split-transform-merge strategy has been broadly used as an architectural constraint in convolutional neural networks for visual recognition tasks.
We propose to exploit this strategy and combine it with our Visual Concept Reasoning Networks (VCRNet) to enable reasoning between high-level visual concepts.
Our proposed model, VCRNet, consistently improves the performance by increasing the number of parameters by less than 1%.
arXiv Detail & Related papers (2020-08-26T20:02:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.