EDTformer: An Efficient Decoder Transformer for Visual Place Recognition
- URL: http://arxiv.org/abs/2412.00784v1
- Date: Sun, 01 Dec 2024 12:14:36 GMT
- Title: EDTformer: An Efficient Decoder Transformer for Visual Place Recognition
- Authors: Tong Jin, Feng Lu, Shuyu Hu, Chun Yuan, Yunpeng Liu,
- Abstract summary: Visual place recognition (VPR) aims to determine the general geographical location of a query image.
We propose an Efficient Decoder Transformer (EDTformer) for feature aggregation.
EDTformer generates robust and discnative global representations for VPR.
- Score: 34.875097011568336
- License:
- Abstract: Visual place recognition (VPR) aims to determine the general geographical location of a query image by retrieving visually similar images from a large geo-tagged database. To obtain a global representation for each place image, most approaches typically focus on the aggregation of deep features extracted from a backbone through using current prominent architectures (e.g., CNNs, MLPs, pooling layer and transformer encoder), giving little attention to the transformer decoder. However, we argue that its strong capability in capturing contextual dependencies and generating accurate features holds considerable potential for the VPR task. To this end, we propose an Efficient Decoder Transformer (EDTformer) for feature aggregation, which consists of several stacked simplified decoder blocks followed by two linear layers to directly generate robust and discriminative global representations for VPR. Specifically, we do this by formulating deep features as the keys and values, as well as a set of independent learnable parameters as the queries. EDTformer can fully utilize the contextual information within deep features, then gradually decode and aggregate the effective features into the learnable queries to form the final global representations. Moreover, to provide powerful deep features for EDTformer and further facilitate the robustness, we use the foundation model DINOv2 as the backbone and propose a Low-Rank Parallel Adaptation (LoPA) method to enhance it, which can refine the intermediate features of the backbone progressively in a memory- and parameter-efficient way. As a result, our method not only outperforms single-stage VPR methods on multiple benchmark datasets, but also outperforms two-stage VPR methods which add a re-ranking with considerable cost. Code will be available at https://github.com/Tong-Jin01/EDTformer.
Related papers
- Kolmogorov-Arnold Network for Remote Sensing Image Semantic Segmentation [8.891804836416275]
We propose a novel semantic segmentation network, namely DeepKANSeg.
First, we introduce a KAN-based deep feature refinement module, namely DeepKAN.
Second, we replace the traditional multi-layer perceptron (MLP) layers in the global-local combined decoder with KAN-based linear layers, namely GLKAN.
arXiv Detail & Related papers (2025-01-13T15:06:51Z) - A Refreshed Similarity-based Upsampler for Direct High-Ratio Feature Upsampling [54.05517338122698]
A popular similarity-based feature upsampling pipeline has been proposed, which utilizes a high-resolution feature as guidance.
We propose an explicitly controllable query-key feature alignment from both semantic-aware and detail-aware perspectives.
We develop a fine-grained neighbor selection strategy on HR features, which is simple yet effective for alleviating mosaic artifacts.
arXiv Detail & Related papers (2024-07-02T14:12:21Z) - FMRT: Learning Accurate Feature Matching with Reconciliatory Transformer [29.95553680263075]
We propose Feature Matching with Reconciliatory Transformer (FMRT), a detector-free method that reconciles different features with multiple receptive fields adaptively.
FMRT yields extraordinary performance on multiple benchmarks, including pose estimation, visual localization, homography estimation, and image matching.
arXiv Detail & Related papers (2023-10-20T15:54:18Z) - ClusVPR: Efficient Visual Place Recognition with Clustering-based
Weighted Transformer [13.0858576267115]
We present ClusVPR, a novel approach that tackles the specific issues of redundant information in duplicate regions and representations of small objects.
ClusVPR introduces a unique paradigm called Clustering-based weighted Transformer Network (CWTNet)
We also introduce the optimized-VLAD layer that significantly reduces the number of parameters and enhances model efficiency.
arXiv Detail & Related papers (2023-10-06T09:01:15Z) - Transformer-based Context Condensation for Boosting Feature Pyramids in
Object Detection [77.50110439560152]
Current object detectors typically have a feature pyramid (FP) module for multi-level feature fusion (MFF)
We propose a novel and efficient context modeling mechanism that can help existing FPs deliver better MFF results.
In particular, we introduce a novel insight that comprehensive contexts can be decomposed and condensed into two types of representations for higher efficiency.
arXiv Detail & Related papers (2022-07-14T01:45:03Z) - Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and
Local Information [15.32353270625554]
Cross-modal remote sensing text-image retrieval (RSCTIR) has recently become an urgent research hotspot due to its ability of enabling fast and flexible information extraction on remote sensing (RS) images.
We first propose a novel RSCTIR framework based on global and local information (GaLR), and design a multi-level information dynamic fusion (MIDF) module to efficaciously integrate features of different levels.
Experiments on public datasets strongly demonstrate the state-of-the-art performance of GaLR methods on the RSCTIR task.
arXiv Detail & Related papers (2022-04-21T03:18:09Z) - Reuse your features: unifying retrieval and feature-metric alignment [3.845387441054033]
DRAN is the first network able to produce the features for the three steps of visual localization.
It achieves competitive performance in terms of robustness and accuracy under challenging conditions in public benchmarks.
arXiv Detail & Related papers (2022-04-13T10:42:00Z) - Unifying Global-Local Representations in Salient Object Detection with Transformer [55.23033277636774]
We introduce a new attention-based encoder, vision transformer, into salient object detection.
With the global view in very shallow layers, the transformer encoder preserves more local representations.
Our method significantly outperforms other FCN-based and transformer-based methods in five benchmarks.
arXiv Detail & Related papers (2021-08-05T17:51:32Z) - Improving Video Instance Segmentation via Temporal Pyramid Routing [61.10753640148878]
Video Instance (VIS) is a new and inherently multi-task problem, which aims to detect, segment and track each instance in a video sequence.
We propose a Temporal Pyramid Routing (TPR) strategy to conditionally align and conduct pixel-level aggregation from a feature pyramid pair of two adjacent frames.
Our approach is a plug-and-play module and can be easily applied to existing instance segmentation methods.
arXiv Detail & Related papers (2021-07-28T03:57:12Z) - HAT: Hierarchical Aggregation Transformers for Person Re-identification [87.02828084991062]
We take advantages of both CNNs and Transformers for image-based person Re-ID with high performance.
Work is the first to take advantages of both CNNs and Transformers for image-based person Re-ID.
arXiv Detail & Related papers (2021-07-13T09:34:54Z) - A Holistically-Guided Decoder for Deep Representation Learning with
Applications to Semantic Segmentation and Object Detection [74.88284082187462]
One common strategy is to adopt dilated convolutions in the backbone networks to extract high-resolution feature maps.
We propose one novel holistically-guided decoder which is introduced to obtain the high-resolution semantic-rich feature maps.
arXiv Detail & Related papers (2020-12-18T10:51:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.