DSFormer: A Dual-Scale Cross-Learning Transformer for Visual Place Recognition
- URL: http://arxiv.org/abs/2507.18444v1
- Date: Thu, 24 Jul 2025 14:29:30 GMT
- Title: DSFormer: A Dual-Scale Cross-Learning Transformer for Visual Place Recognition
- Authors: Haiyang Jiang, Songhao Piao, Chao Gao, Lei Yu, Liguo Chen,
- Abstract summary: We propose a novel framework that integrates Dual-Scale-Former (DSFormer), a Transformer-based cross-learning module, with an innovative block clustering strategy.<n>Our approach achieves state-of-the-art performance across most benchmark datasets.
- Score: 16.386674597850778
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Place Recognition (VPR) is crucial for robust mobile robot localization, yet it faces significant challenges in maintaining reliable performance under varying environmental conditions and viewpoints. To address this, we propose a novel framework that integrates Dual-Scale-Former (DSFormer), a Transformer-based cross-learning module, with an innovative block clustering strategy. DSFormer enhances feature representation by enabling bidirectional information transfer between dual-scale features extracted from the final two CNN layers, capturing both semantic richness and spatial details through self-attention for long-range dependencies within each scale and shared cross-attention for cross-scale learning. Complementing this, our block clustering strategy repartitions the widely used San Francisco eXtra Large (SF-XL) training dataset from multiple distinct perspectives, optimizing data organization to further bolster robustness against viewpoint variations. Together, these innovations not only yield a robust global embedding adaptable to environmental changes but also reduce the required training data volume by approximately 30\% compared to previous partitioning methods. Comprehensive experiments demonstrate that our approach achieves state-of-the-art performance across most benchmark datasets, surpassing advanced reranking methods like DELG, Patch-NetVLAD, TransVPR, and R2Former as a global retrieval solution using 512-dim global descriptors, while significantly improving computational efficiency.
Related papers
- Semantic-Aligned Learning with Collaborative Refinement for Unsupervised VI-ReID [82.12123628480371]
Unsupervised person re-identification (USL-VI-ReID) seeks to match pedestrian images of the same individual across different modalities without human annotations for model learning.<n>Previous methods unify pseudo-labels of cross-modality images through label association algorithms and then design contrastive learning framework for global feature learning.<n>We propose a Semantic-Aligned Learning with Collaborative Refinement (SALCR) framework, which builds up objective for specific fine-grained patterns emphasized by each modality.
arXiv Detail & Related papers (2025-04-27T13:58:12Z) - Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation [158.37640586809187]
Restoring any degraded image efficiently via just one model has become increasingly significant.<n>Our approach, termed AnyIR, takes a unified path that leverages inherent similarity across various degradations.<n>To fuse the degradation awareness and the contextualized attention, a spatial-frequency parallel fusion strategy is proposed.
arXiv Detail & Related papers (2025-04-19T09:54:46Z) - Effective Dual-Region Augmentation for Reduced Reliance on Large Amounts of Labeled Data [1.0901840476380924]
This paper introduces a novel dual-region augmentation approach designed to reduce reliance on large-scale labeled datasets.<n>Our method performs targeted data transformations by applying random noise perturbations to foreground objects.<n>By augmenting training data through structured transformations, our method enables model generalization across domains.
arXiv Detail & Related papers (2025-04-17T16:42:33Z) - C2D-ISR: Optimizing Attention-based Image Super-resolution from Continuous to Discrete Scales [6.700548615812325]
We propose a novel framework, textbfC2D-ISR, for optimizing attention-based image super-resolution models.<n>Our approach is based on a two-stage training methodology and a hierarchical encoding mechanism.<n>In addition, we generalize the hierarchical encoding mechanism with existing attention-based network structures.
arXiv Detail & Related papers (2025-03-17T21:52:18Z) - BHViT: Binarized Hybrid Vision Transformer [53.38894971164072]
Model binarization has made significant progress in enabling real-time and energy-efficient computation for convolutional neural networks (CNN)<n>We propose BHViT, a binarization-friendly hybrid ViT architecture and its full binarization model with the guidance of three important observations.<n>Our proposed algorithm achieves SOTA performance among binary ViT methods.
arXiv Detail & Related papers (2025-03-04T08:35:01Z) - One for all: A novel Dual-space Co-training baseline for Large-scale
Multi-View Clustering [42.92751228313385]
We propose a novel multi-view clustering model, named Dual-space Co-training Large-scale Multi-view Clustering (DSCMC)
The main objective of our approach is to enhance the clustering performance by leveraging co-training in two distinct spaces.
Our algorithm has an approximate linear computational complexity, which guarantees its successful application on large-scale datasets.
arXiv Detail & Related papers (2024-01-28T16:30:13Z) - Dual-scale Enhanced and Cross-generative Consistency Learning for Semi-supervised Medical Image Segmentation [49.57907601086494]
Medical image segmentation plays a crucial role in computer-aided diagnosis.
We propose a novel Dual-scale Enhanced and Cross-generative consistency learning framework for semi-supervised medical image (DEC-Seg)
arXiv Detail & Related papers (2023-12-26T12:56:31Z) - GARNet: Global-Aware Multi-View 3D Reconstruction Network and the
Cost-Performance Tradeoff [10.8606881536924]
We propose a global-aware attention-based fusion approach that builds the correlation between each branch and the global to provide a comprehensive foundation for weights inference.
In order to enhance the ability of the network, we introduce a novel loss function to supervise the shape overall.
Experiments on ShapeNet verify that our method outperforms existing SOTA methods.
arXiv Detail & Related papers (2022-11-04T07:45:19Z) - Multi-scale and Cross-scale Contrastive Learning for Semantic
Segmentation [5.281694565226513]
We apply contrastive learning to enhance the discriminative power of the multi-scale features extracted by semantic segmentation networks.
By first mapping the encoder's multi-scale representations to a common feature space, we instantiate a novel form of supervised local-global constraint.
arXiv Detail & Related papers (2022-03-25T01:24:24Z) - Towards Scale Consistent Monocular Visual Odometry by Learning from the
Virtual World [83.36195426897768]
We propose VRVO, a novel framework for retrieving the absolute scale from virtual data.
We first train a scale-aware disparity network using both monocular real images and stereo virtual data.
The resulting scale-consistent disparities are then integrated with a direct VO system.
arXiv Detail & Related papers (2022-03-11T01:51:54Z) - Semi-supervised Domain Adaptive Structure Learning [72.01544419893628]
Semi-supervised domain adaptation (SSDA) is a challenging problem requiring methods to overcome both 1) overfitting towards poorly annotated data and 2) distribution shift across domains.
We introduce an adaptive structure learning method to regularize the cooperation of SSL and DA.
arXiv Detail & Related papers (2021-12-12T06:11:16Z) - Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS)
Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage.
We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.