SupScene: Learning Overlap-Aware Global Descriptor for Unconstrained SfM
- URL: http://arxiv.org/abs/2601.11930v1
- Date: Sat, 17 Jan 2026 06:28:47 GMT
- Title: SupScene: Learning Overlap-Aware Global Descriptor for Unconstrained SfM
- Authors: Xulei Shi, Maoyu Wang, Yuning Peng, Guanbo Wang, Xin Wang, Qi Chen, Pengjie Tao,
- Abstract summary: SupScene is a novel solution that learns global descriptors tailored for finding overlapping image pairs of similar geometric nature for Structure-from-Motion (SfM)<n>Our method achieves state-of-the-art performance, significantly outperforming NetVLAD while introducing a negligible number of additional trainable parameters.
- Score: 10.006619357851843
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image retrieval is a critical step for alleviating the quadratic complexity of image matching in unconstrained Structure-from-Motion (SfM). However, in this context, image retrieval typically focuses more on the image pairs of geometric matchability than on those of semantic similarity, a nuance that most existing deep learning-based methods guided by batched binaries (overlapping vs. non-overlapping pairs) fail to capture. In this paper, we introduce SupScene, a novel solution that learns global descriptors tailored for finding overlapping image pairs of similar geometric nature for SfM. First, to better underline co-visible regions, we employ a subgraph-based training strategy that moves beyond equally important isolated pairs, leveraging ground-truth geometric overlapping relationships with various weights to provide fine-grained supervision via a soft supervised contrastive loss. Second, we introduce DiVLAD, a DINO-inspired VLAD aggregator that leverages the inherent multi-head attention maps from the last block of ViT. And then, a learnable gating mechanism is designed to adaptively utilize these semantically salient cues with visual features, enabling a more discriminative global descriptor. Extensive experiments on the GL3D dataset demonstrate that our method achieves state-of-the-art performance, significantly outperforming NetVLAD while introducing a negligible number of additional trainable parameters. Furthermore, we show that the proposed training strategy brings consistent gains across different aggregation techniques. Code and models are available at https://anonymous.4open.science/r/SupScene-5B73.
Related papers
- Adversarial Graph Fusion for Incomplete Multi-view Semi-supervised Learning with Tensorial Imputation [22.080075025365208]
View missing remains a significant challenge in graph-based multi-view semi-supervised learning.<n>We propose a novel incomplete multi-view semi-supervised learning method, termed AGF-TI.
arXiv Detail & Related papers (2025-09-19T13:12:41Z) - Sparse and Dense Retrievers Learn Better Together: Joint Sparse-Dense Optimization for Text-Image Retrieval [11.20814404187967]
We propose a framework that enables bi-directional learning between dense and sparse representations through Self-Knowledge Distillation.<n>This bi-directional learning is achieved using an integrated similarity score-a weighted sum of dense and sparse similarities-which serves as a shared teacher signal for both representations.<n>Experiments on MSCOCO and Flickr30k demonstrate that our sparse retriever not only outperforms existing sparse baselines, but also achieves performance comparable to-or even surpassing-its dense counterparts.
arXiv Detail & Related papers (2025-08-22T13:25:58Z) - Semi-supervised Semantic Segmentation for Remote Sensing Images via Multi-scale Uncertainty Consistency and Cross-Teacher-Student Attention [59.19580789952102]
This paper proposes a novel semi-supervised Multi-Scale Uncertainty and Cross-Teacher-Student Attention (MUCA) model for RS image semantic segmentation tasks.<n>MUCA constrains the consistency among feature maps at different layers of the network by introducing a multi-scale uncertainty consistency regularization.<n>MUCA utilizes a Cross-Teacher-Student attention mechanism to guide the student network, guiding the student network to construct more discriminative feature representations.
arXiv Detail & Related papers (2025-01-18T11:57:20Z) - SMLNet: A SPD Manifold Learning Network for Infrared and Visible Image Fusion [60.18614468818683]
We propose a novel SPD (symmetric positive definite) manifold learning for multi-modal image fusion.<n>Our framework exhibits superior performance compared to the current state-of-the-art methods.
arXiv Detail & Related papers (2024-11-16T03:09:49Z) - Hyperbolic Image-and-Pointcloud Contrastive Learning for 3D Classification [14.439996427728483]
We propose a hyperbolic image-and-pointcloud contrastive learning method (HyperIPC)
For the intra-modal branch, we rely on the intrinsic geometric structure to explore the hyperbolic embedding representation of point cloud.
For the cross-modal branch, we leverage images to guide the point cloud in establishing strong semantic hierarchical correlations.
arXiv Detail & Related papers (2024-09-24T07:13:37Z) - Fine-grained Image-to-LiDAR Contrastive Distillation with Visual Foundation Models [55.99654128127689]
Visual Foundation Models (VFMs) are used to generate semantic labels for weakly-supervised pixel-to-point contrastive distillation.<n>We adapt sampling probabilities of points to address imbalances in spatial distribution and category frequency.<n>Our approach consistently surpasses existing image-to-LiDAR contrastive distillation methods in downstream tasks.
arXiv Detail & Related papers (2024-05-23T07:48:19Z) - Superpixel Semantics Representation and Pre-training for Vision-Language Task [11.029236633301222]
coarse-grained semantic interactions in image space should not be ignored.
This paper proposes superpixels as comprehensive and robust visual primitives.
It allows parsing the entire image as a fine-to-coarse visual hierarchy.
arXiv Detail & Related papers (2023-10-20T12:26:04Z) - Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects.
In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL)
A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - A Dual-branch Self-supervised Representation Learning Framework for
Tumour Segmentation in Whole Slide Images [12.961686610789416]
Self-supervised learning (SSL) has emerged as an alternative solution to reduce the annotation overheads in whole slide images.
These SSL approaches are not designed for handling multi-resolution WSIs, which limits their performance in learning discriminative image features.
We propose a Dual-branch SSL Framework for WSI tumour segmentation (DSF-WSI) that can effectively learn image features from multi-resolution WSIs.
arXiv Detail & Related papers (2023-03-20T10:57:28Z) - Digging Into Self-Supervised Learning of Feature Descriptors [14.47046413243358]
We propose a set of improvements that combined lead to powerful feature descriptors.
We show that increasing the search space from in-pair to in-batch for hard negative mining brings consistent improvement.
We demonstrate that a combination of synthetic homography transformation, color augmentation, and photorealistic image stylization produces useful representations.
arXiv Detail & Related papers (2021-10-10T12:22:44Z) - Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching.
We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.