PlaceFormer: Transformer-based Visual Place Recognition using Multi-Scale Patch Selection and Fusion
- URL: http://arxiv.org/abs/2401.13082v2
- Date: Mon, 27 May 2024 22:18:45 GMT
- Title: PlaceFormer: Transformer-based Visual Place Recognition using Multi-Scale Patch Selection and Fusion
- Authors: Shyam Sundar Kannan, Byung-Cheol Min,
- Abstract summary: PlaceFormer is a transformer-based approach for visual place recognition.
PlaceFormer employs patch tokens from the transformer to create global image descriptors.
It selects patches that correspond to task-relevant areas in an image.
- Score: 2.3020018305241337
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual place recognition is a challenging task in the field of computer vision, and autonomous robotics and vehicles, which aims to identify a location or a place from visual inputs. Contemporary methods in visual place recognition employ convolutional neural networks and utilize every region within the image for the place recognition task. However, the presence of dynamic and distracting elements in the image may impact the effectiveness of the place recognition process. Therefore, it is meaningful to focus on task-relevant regions of the image for improved recognition. In this paper, we present PlaceFormer, a novel transformer-based approach for visual place recognition. PlaceFormer employs patch tokens from the transformer to create global image descriptors, which are then used for image retrieval. To re-rank the retrieved images, PlaceFormer merges the patch tokens from the transformer to form multi-scale patches. Utilizing the transformer's self-attention mechanism, it selects patches that correspond to task-relevant areas in an image. These selected patches undergo geometric verification, generating similarity scores across different patch sizes. Subsequently, spatial scores from each patch size are fused to produce a final similarity score. This score is then used to re-rank the images initially retrieved using global image descriptors. Extensive experiments on benchmark datasets demonstrate that PlaceFormer outperforms several state-of-the-art methods in terms of accuracy and computational efficiency, requiring less time and memory.
Related papers
- Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers [5.825612611197359]
Fine-grained recognition involves the classification of images from subordinate macro-categories.
We propose a novel and computationally inexpensive metric to identify discriminative regions in an image.
Our method achieves these results at a much lower computational cost compared to the alternatives.
arXiv Detail & Related papers (2024-07-17T10:04:54Z) - TCFormer: Visual Recognition via Token Clustering Transformer [79.24723479088097]
We propose the Token Clustering Transformer (TCFormer), which generates dynamic vision tokens based on semantic meaning.
Our dynamic tokens possess two crucial characteristics: (1) Representing image regions with similar semantic meanings using the same vision token, even if those regions are not adjacent, and (2) concentrating on regions with valuable details and represent them using fine tokens.
arXiv Detail & Related papers (2024-07-16T02:26:18Z) - Breaking the Frame: Visual Place Recognition by Overlap Prediction [53.17564423756082]
We propose a novel visual place recognition approach based on overlap prediction, called VOP.
VOP proceeds co-visible image sections by obtaining patch-level embeddings using a Vision Transformer backbone.
Our approach uses a voting mechanism to assess overlap scores for potential database images.
arXiv Detail & Related papers (2024-06-23T20:00:20Z) - Register assisted aggregation for Visual Place Recognition [4.5476780843439535]
Visual Place Recognition (VPR) refers to the process of using computer vision to recognize the position of the current query image.
Previous methods often discarded useless features while uncontrolled discarding features that help improve recognition accuracy.
We propose a new feature aggregation method to address this issue. Specifically, in order to obtain global and local features that contain discriminative place information, we added some registers on top of the original image tokens.
arXiv Detail & Related papers (2024-05-19T11:36:52Z) - CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition [73.51329037954866]
We propose a robust global representation method with cross-image correlation awareness for visual place recognition.
Our method uses the attention mechanism to correlate multiple images within a batch.
Our method outperforms state-of-the-art methods by a large margin with significantly less training time.
arXiv Detail & Related papers (2024-02-29T15:05:11Z) - Patch-Wise Self-Supervised Visual Representation Learning: A Fine-Grained Approach [4.9204263448542465]
This study introduces an innovative, fine-grained dimension by integrating patch-level discrimination into self-supervised visual representation learning.
We employ a distinctive photometric patch-level augmentation, where each patch is individually augmented, independent from other patches within the same view.
We present a simple yet effective patch-matching algorithm to find the corresponding patches across the augmented views.
arXiv Detail & Related papers (2023-10-28T09:35:30Z) - ObjectFormer for Image Manipulation Detection and Localization [118.89882740099137]
We propose ObjectFormer to detect and localize image manipulations.
We extract high-frequency features of the images and combine them with RGB features as multimodal patch embeddings.
We conduct extensive experiments on various datasets and the results verify the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-03-28T12:27:34Z) - TransVPR: Transformer-based place recognition with multi-level attention
aggregation [9.087163485833058]
We introduce a novel holistic place recognition model, TransVPR, based on vision Transformers.
TransVPR achieves state-of-the-art performance on several real-world benchmarks.
arXiv Detail & Related papers (2022-01-06T10:20:24Z) - Sparse Spatial Transformers for Few-Shot Learning [6.271261279657655]
Learning from limited data is challenging because data scarcity leads to a poor generalization of the trained model.
We propose a novel transformer-based neural network architecture called sparse spatial transformers.
Our method finds task-relevant features and suppresses task-irrelevant features.
arXiv Detail & Related papers (2021-09-27T10:36:32Z) - A Hierarchical Transformation-Discriminating Generative Model for Few
Shot Anomaly Detection [93.38607559281601]
We devise a hierarchical generative model that captures the multi-scale patch distribution of each training image.
The anomaly score is obtained by aggregating the patch-based votes of the correct transformation across scales and image regions.
arXiv Detail & Related papers (2021-04-29T17:49:48Z) - Geometrically Mappable Image Features [85.81073893916414]
Vision-based localization of an agent in a map is an important problem in robotics and computer vision.
We propose a method that learns image features targeted for image-retrieval-based localization.
arXiv Detail & Related papers (2020-03-21T15:36:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.