Transformer-Guided Convolutional Neural Network for Cross-View
Geolocalization
- URL: http://arxiv.org/abs/2204.09967v1
- Date: Thu, 21 Apr 2022 08:46:41 GMT
- Title: Transformer-Guided Convolutional Neural Network for Cross-View
Geolocalization
- Authors: Teng Wang and Shujuan Fan and Daikun Liu and Changyin Sun
- Abstract summary: We propose a novel Transformer-guided convolutional neural network (TransGCNN) architecture.
Our TransGCNN consists of a CNN backbone extracting feature map from an input image and a Transformer head modeling global context.
Experiments on popular benchmark datasets demonstrate that our model achieves top-1 accuracy of 94.12% and 84.92% on CVUSA and CVACT_val, respectively.
- Score: 20.435023745201878
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ground-to-aerial geolocalization refers to localizing a ground-level query
image by matching it to a reference database of geo-tagged aerial imagery. This
is very challenging due to the huge perspective differences in visual
appearances and geometric configurations between these two views. In this work,
we propose a novel Transformer-guided convolutional neural network (TransGCNN)
architecture, which couples CNN-based local features with Transformer-based
global representations for enhanced representation learning. Specifically, our
TransGCNN consists of a CNN backbone extracting feature map from an input image
and a Transformer head modeling global context from the CNN map. In particular,
our Transformer head acts as a spatial-aware importance generator to select
salient CNN features as the final feature representation. Such a coupling
procedure allows us to leverage a lightweight Transformer network to greatly
enhance the discriminative capability of the embedded features. Furthermore, we
design a dual-branch Transformer head network to combine image features from
multi-scale windows in order to improve details of the global feature
representation. Extensive experiments on popular benchmark datasets demonstrate
that our model achieves top-1 accuracy of 94.12\% and 84.92\% on CVUSA and
CVACT_val, respectively, which outperforms the second-performing baseline with
less than 50% parameters and almost 2x higher frame rate, therefore achieving a
preferable accuracy-efficiency tradeoff.
Related papers
- Interaction-Guided Two-Branch Image Dehazing Network [1.26404863283601]
Image dehazing aims to restore clean images from hazy ones.
CNNs and Transformers have demonstrated exceptional performance in local and global feature extraction.
We propose a novel dual-branch image dehazing framework that guides CNN and Transformer components interactively.
arXiv Detail & Related papers (2024-10-14T03:21:56Z) - DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention [1.5624421399300303]
We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs)
Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations.
These representations are then adapted for transformer input through an innovative patch tokenization.
arXiv Detail & Related papers (2024-07-18T22:15:35Z) - ConvFormer: Combining CNN and Transformer for Medical Image Segmentation [17.88894109620463]
We propose a hierarchical CNN and Transformer hybrid architecture, called ConvFormer, for medical image segmentation.
Our ConvFormer, trained from scratch, outperforms various CNN- or Transformer-based architectures, achieving state-of-the-art performance.
arXiv Detail & Related papers (2022-11-15T23:11:22Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - Boosting Salient Object Detection with Transformer-based Asymmetric
Bilateral U-Net [19.21709807149165]
Existing salient object detection (SOD) methods mainly rely on U-shaped convolution neural networks (CNNs) with skip connections.
We propose a transformer-based Asymmetric Bilateral U-Net (ABiU-Net) to learn both global and local representations for SOD.
ABiU-Net performs favorably against previous state-of-the-art SOD methods.
arXiv Detail & Related papers (2021-08-17T19:45:28Z) - HAT: Hierarchical Aggregation Transformers for Person Re-identification [87.02828084991062]
We take advantages of both CNNs and Transformers for image-based person Re-ID with high performance.
Work is the first to take advantages of both CNNs and Transformers for image-based person Re-ID.
arXiv Detail & Related papers (2021-07-13T09:34:54Z) - Conformer: Local Features Coupling Global Representations for Visual
Recognition [72.9550481476101]
We propose a hybrid network structure, termed Conformer, to take advantage of convolutional operations and self-attention mechanisms for enhanced representation learning.
Experiments show that Conformer, under the comparable parameter complexity, outperforms the visual transformer (DeiT-B) by 2.3% on ImageNet.
arXiv Detail & Related papers (2021-05-09T10:00:03Z) - Spherical Transformer: Adapting Spherical Signal to CNNs [53.18482213611481]
Spherical Transformer can transform spherical signals into vectors that can be directly processed by standard CNNs.
We evaluate our approach on the tasks of spherical MNIST recognition, 3D object classification and omnidirectional image semantic segmentation.
arXiv Detail & Related papers (2021-01-11T12:33:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.