StructVPR++: Distill Structural and Semantic Knowledge with Weighting Samples for Visual Place Recognition
- URL: http://arxiv.org/abs/2503.06601v2
- Date: Sun, 30 Mar 2025 07:53:14 GMT
- Title: StructVPR++: Distill Structural and Semantic Knowledge with Weighting Samples for Visual Place Recognition
- Authors: Yanqing Shen, Sanping Zhou, Jingwen Fu, Ruotong Wang, Shitao Chen, Nanning Zheng,
- Abstract summary: StructVPR++ is a framework that embeds structural and semantic knowledge into RGB global representations via segmentation-guided distillation.<n>Our key innovation lies in decoupling label-specific features from global descriptors, enabling explicit semantic alignment between image pairs.<n>Experiments on four benchmarks demonstrate that StructVPR++ surpasses state-of-the-art global methods by 5-23% in Recall@1.
- Score: 42.06059249327199
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual place recognition is a challenging task for autonomous driving and robotics, which is usually considered as an image retrieval problem. A commonly used two-stage strategy involves global retrieval followed by re-ranking using patch-level descriptors. Most deep learning-based methods in an end-to-end manner cannot extract global features with sufficient semantic information from RGB images. In contrast, re-ranking can utilize more explicit structural and semantic information in one-to-one matching process, but it is time-consuming. To bridge the gap between global retrieval and re-ranking and achieve a good trade-off between accuracy and efficiency, we propose StructVPR++, a framework that embeds structural and semantic knowledge into RGB global representations via segmentation-guided distillation. Our key innovation lies in decoupling label-specific features from global descriptors, enabling explicit semantic alignment between image pairs without requiring segmentation during deployment. Furthermore, we introduce a sample-wise weighted distillation strategy that prioritizes reliable training pairs while suppressing noisy ones. Experiments on four benchmarks demonstrate that StructVPR++ surpasses state-of-the-art global methods by 5-23% in Recall@1 and even outperforms many two-stage approaches, achieving real-time efficiency with a single RGB input.
Related papers
- EDTformer: An Efficient Decoder Transformer for Visual Place Recognition [34.875097011568336]
Visual place recognition (VPR) aims to determine the general geographical location of a query image.<n>We propose an Efficient Decoder Transformer (EDTformer) for feature aggregation.<n>EDTformer generates robust and discnative global representations for VPR.
arXiv Detail & Related papers (2024-12-01T12:14:36Z) - Semantic Segmentation and Scene Reconstruction of RGB-D Image Frames: An End-to-End Modular Pipeline for Robotic Applications [0.7951977175758216]
Traditional RGB-D processing pipelines focus primarily on geometric reconstruction.
We introduce a novel end-to-end modular pipeline that integrates semantic segmentation, human tracking, point-cloud fusion, and scene reconstruction.
We validate our approach on benchmark datasets and real-world Kinect RGB-D data, demonstrating improved efficiency, accuracy, and usability.
arXiv Detail & Related papers (2024-10-23T16:01:31Z) - Sharing Key Semantics in Transformer Makes Efficient Image Restoration [148.22790334216117]
Self-attention mechanism, a cornerstone of Vision Transformers (ViTs) tends to encompass all global cues.<n>Small segments of a degraded image, particularly those closely aligned semantically, provide particularly relevant information to aid in the restoration process.<n>We propose boosting IR's performance by sharing the key semantics via Transformer for IR (ie, SemanIR) in this paper.
arXiv Detail & Related papers (2024-05-30T12:45:34Z) - Deep Homography Estimation for Visual Place Recognition [49.235432979736395]
We propose a transformer-based deep homography estimation (DHE) network.
It takes the dense feature map extracted by a backbone network as input and fits homography for fast and learnable geometric verification.
Experiments on benchmark datasets show that our method can outperform several state-of-the-art methods.
arXiv Detail & Related papers (2024-02-25T13:22:17Z) - StructVPR: Distill Structural Knowledge with Weighting Samples for
Visual Place Recognition [49.58170209388029]
Visual place recognition (VPR) is usually considered as a specific image retrieval problem.
We propose StructVPR, a novel training architecture for VPR, to enhance structural knowledge in RGB global features.
Ours achieves state-of-the-art performance while maintaining a low computational cost.
arXiv Detail & Related papers (2022-12-02T02:52:01Z) - Deepfake Detection via Joint Unsupervised Reconstruction and Supervised
Classification [25.84902508816679]
We introduce a novel approach for deepfake detection, which considers the reconstruction and classification tasks simultaneously.
This method shares the information learned by one task with the other, which focuses on a different aspect other existing works rarely consider.
Our method achieves state-of-the-art performance on three commonly-used datasets.
arXiv Detail & Related papers (2022-11-24T05:44:26Z) - Boosting Few-shot Semantic Segmentation with Transformers [81.43459055197435]
TRansformer-based Few-shot Semantic segmentation method (TRFS)
Our model consists of two modules: Global Enhancement Module (GEM) and Local Enhancement Module (LEM)
arXiv Detail & Related papers (2021-08-04T20:09:21Z) - Bi-directional Cross-Modality Feature Propagation with
Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [59.94819184452694]
Depth information has proven to be a useful cue in the semantic segmentation of RGBD images for providing a geometric counterpart to the RGB representation.
Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion.
In this paper, we propose a unified and efficient Crossmodality Guided to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively.
arXiv Detail & Related papers (2020-07-17T18:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.