TransVPR: Transformer-based place recognition with multi-level attention
aggregation
- URL: http://arxiv.org/abs/2201.02001v1
- Date: Thu, 6 Jan 2022 10:20:24 GMT
- Title: TransVPR: Transformer-based place recognition with multi-level attention
aggregation
- Authors: Ruotong Wang, Yanqing Shen, Weiliang Zuo, Sanping Zhou, Nanning Zhen
- Abstract summary: We introduce a novel holistic place recognition model, TransVPR, based on vision Transformers.
TransVPR achieves state-of-the-art performance on several real-world benchmarks.
- Score: 9.087163485833058
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual place recognition is a challenging task for applications such as
autonomous driving navigation and mobile robot localization. Distracting
elements presenting in complex scenes often lead to deviations in the
perception of visual place. To address this problem, it is crucial to integrate
information from only task-relevant regions into image representations. In this
paper, we introduce a novel holistic place recognition model, TransVPR, based
on vision Transformers. It benefits from the desirable property of the
self-attention operation in Transformers which can naturally aggregate
task-relevant features. Attentions from multiple levels of the Transformer,
which focus on different regions of interest, are further combined to generate
a global image representation. In addition, the output tokens from Transformer
layers filtered by the fused attention mask are considered as key-patch
descriptors, which are used to perform spatial matching to re-rank the
candidates retrieved by the global image features. The whole model allows
end-to-end training with a single objective and image-level supervision.
TransVPR achieves state-of-the-art performance on several real-world benchmarks
while maintaining low computational time and storage requirements.
Related papers
- Other Tokens Matter: Exploring Global and Local Features of Vision Transformers for Object Re-Identification [63.147482497821166]
We first explore the influence of global and local features of ViT and then propose a novel Global-Local Transformer (GLTrans) for high-performance object Re-ID.
Our proposed method achieves superior performance on four object Re-ID benchmarks.
arXiv Detail & Related papers (2024-04-23T12:42:07Z) - PlaceFormer: Transformer-based Visual Place Recognition using Multi-Scale Patch Selection and Fusion [2.3020018305241337]
PlaceFormer is a transformer-based approach for visual place recognition.
PlaceFormer employs patch tokens from the transformer to create global image descriptors.
It selects patches that correspond to task-relevant areas in an image.
arXiv Detail & Related papers (2024-01-23T20:28:06Z) - TransY-Net:Learning Fully Transformer Networks for Change Detection of
Remote Sensing Images [64.63004710817239]
We propose a novel Transformer-based learning framework named TransY-Net for remote sensing image CD.
It improves the feature extraction from a global view and combines multi-level visual features in a pyramid manner.
Our proposed method achieves a new state-of-the-art performance on four optical and two SAR image CD benchmarks.
arXiv Detail & Related papers (2023-10-22T07:42:19Z) - FMRT: Learning Accurate Feature Matching with Reconciliatory Transformer [29.95553680263075]
We propose Feature Matching with Reconciliatory Transformer (FMRT), a detector-free method that reconciles different features with multiple receptive fields adaptively.
FMRT yields extraordinary performance on multiple benchmarks, including pose estimation, visual localization, homography estimation, and image matching.
arXiv Detail & Related papers (2023-10-20T15:54:18Z) - Learning Explicit Object-Centric Representations with Vision
Transformers [81.38804205212425]
We build on the self-supervision task of masked autoencoding and explore its effectiveness for learning object-centric representations with transformers.
We show that the model efficiently learns to decompose simple scenes as measured by segmentation metrics on several multi-object benchmarks.
arXiv Detail & Related papers (2022-10-25T16:39:49Z) - Efficient Hybrid Transformer: Learning Global-local Context for Urban
Sence Segmentation [11.237929167356725]
We propose an efficient hybrid Transformer (EHT) for semantic segmentation of urban scene images.
EHT takes advantage of CNNs and Transformer, learning global-local context to strengthen the feature representation.
The proposed EHT achieves a 67.0% mIoU on the UAVid test set and outperforms other lightweight models significantly.
arXiv Detail & Related papers (2021-09-18T13:55:38Z) - RAMS-Trans: Recurrent Attention Multi-scale Transformer forFine-grained
Image Recognition [26.090419694326823]
localization and amplification of region attention is an important factor, which has been explored a lot by convolutional neural networks (CNNs) based approaches.
We propose the recurrent attention multi-scale transformer (RAMS-Trans) which uses the transformer's self-attention to learn discriminative region attention.
arXiv Detail & Related papers (2021-07-17T06:22:20Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Rethinking Global Context in Crowd Counting [70.54184500538338]
A pure transformer is used to extract features with global information from overlapping image patches.
Inspired by classification, we add a context token to the input sequence, to facilitate information exchange with tokens corresponding to image patches.
arXiv Detail & Related papers (2021-05-23T12:44:27Z) - A Video Is Worth Three Views: Trigeminal Transformers for Video-based
Person Re-identification [77.08204941207985]
Video-based person re-identification (Re-ID) aims to retrieve video sequences of the same person under non-overlapping cameras.
We propose a novel framework named Trigeminal Transformers (TMT) for video-based person Re-ID.
arXiv Detail & Related papers (2021-04-05T02:50:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.