Learning Cross-view Visual Geo-localization without Ground Truth
- URL: http://arxiv.org/abs/2403.12702v1
- Date: Tue, 19 Mar 2024 13:01:57 GMT
- Title: Learning Cross-view Visual Geo-localization without Ground Truth
- Authors: Haoyuan Li, Chang Xu, Wen Yang, Huai Yu, Gui-Song Xia,
- Abstract summary: Cross-View Geo-Localization (CVGL) involves determining the geographical location of a query image by matching it with a corresponding GPS-tagged reference image.
Current state-of-the-art methods rely on training models with labeled paired images, incurring substantial annotation costs and training burdens.
We investigate the adaptation of frozen models for CVGL without requiring ground truth pair labels.
- Score: 48.51859322439286
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-View Geo-Localization (CVGL) involves determining the geographical location of a query image by matching it with a corresponding GPS-tagged reference image. Current state-of-the-art methods predominantly rely on training models with labeled paired images, incurring substantial annotation costs and training burdens. In this study, we investigate the adaptation of frozen models for CVGL without requiring ground truth pair labels. We observe that training on unlabeled cross-view images presents significant challenges, including the need to establish relationships within unlabeled data and reconcile view discrepancies between uncertain queries and references. To address these challenges, we propose a self-supervised learning framework to train a learnable adapter for a frozen Foundation Model (FM). This adapter is designed to map feature distributions from diverse views into a uniform space using unlabeled data exclusively. To establish relationships within unlabeled data, we introduce an Expectation-Maximization-based Pseudo-labeling module, which iteratively estimates associations between cross-view features and optimizes the adapter. To maintain the robustness of the FM's representation, we incorporate an information consistency module with a reconstruction loss, ensuring that adapted features retain strong discriminative ability across views. Experimental results demonstrate that our proposed method achieves significant improvements over vanilla FMs and competitive accuracy compared to supervised methods, while necessitating fewer training parameters and relying solely on unlabeled data. Evaluation of our adaptation for task-specific models further highlights its broad applicability.
Related papers
- ACTRESS: Active Retraining for Semi-supervised Visual Grounding [52.08834188447851]
A previous study, RefTeacher, makes the first attempt to tackle this task by adopting the teacher-student framework to provide pseudo confidence supervision and attention-based supervision.
This approach is incompatible with current state-of-the-art visual grounding models, which follow the Transformer-based pipeline.
Our paper proposes the ACTive REtraining approach for Semi-Supervised Visual Grounding, abbreviated as ACTRESS.
arXiv Detail & Related papers (2024-07-03T16:33:31Z) - Robust Pseudo-label Learning with Neighbor Relation for Unsupervised Visible-Infrared Person Re-Identification [33.50249784731248]
Unsupervised Visible-Infrared Person Re-identification (USVI-ReID) aims to match pedestrian images across visible and infrared modalities without any annotations.
Recently, clustered pseudo-label methods have become predominant in USVI-ReID, although the inherent noise in pseudo-labels presents a significant obstacle.
We design a Robust Pseudo-label Learning with Neighbor Relation (RPNR) framework to correct noisy pseudo-labels.
Comprehensive experiments conducted on two widely recognized benchmarks, SYSU-MM01 and RegDB, demonstrate that RPNR outperforms the current state-of-the-art GUR with an average
arXiv Detail & Related papers (2024-05-09T08:17:06Z) - Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization [28.941724648519102]
This paper investigates the effective utilization of unlabeled data for large-area cross-view geo-localization (CVGL)
Common approaches to CVGL rely on ground-satellite image pairs and employ label-driven supervised training.
We propose an unsupervised framework including a cross-view projection to guide the model for retrieving initial pseudo-labels.
arXiv Detail & Related papers (2024-03-21T07:48:35Z) - Generalized Face Forgery Detection via Adaptive Learning for Pre-trained Vision Transformer [54.32283739486781]
We present a textbfForgery-aware textbfAdaptive textbfVision textbfTransformer (FA-ViT) under the adaptive learning paradigm.
FA-ViT achieves 93.83% and 78.32% AUC scores on Celeb-DF and DFDC datasets in the cross-dataset evaluation.
arXiv Detail & Related papers (2023-09-20T06:51:11Z) - Unsupervised Continual Semantic Adaptation through Neural Rendering [32.099350613956716]
We study continual multi-scene adaptation for the task of semantic segmentation.
We propose training a Semantic-NeRF network for each scene by fusing the predictions of a segmentation model.
We evaluate our approach on ScanNet, where we outperform both a voxel-based baseline and a state-of-the-art unsupervised domain adaptation method.
arXiv Detail & Related papers (2022-11-25T09:31:41Z) - Adaptive Graph-Based Feature Normalization for Facial Expression
Recognition [1.2246649738388389]
We propose an Adaptive Graph-based Feature Normalization (AGFN) method to protect Facial Expression Recognition models from data uncertainties.
Our method outperforms state-of-the-art works with accuracies of 91.84% and 91.11% on benchmark datasets.
arXiv Detail & Related papers (2022-07-22T14:57:56Z) - Imposing Consistency for Optical Flow Estimation [73.53204596544472]
Imposing consistency through proxy tasks has been shown to enhance data-driven learning.
This paper introduces novel and effective consistency strategies for optical flow estimation.
arXiv Detail & Related papers (2022-04-14T22:58:30Z) - Unpaired Referring Expression Grounding via Bidirectional Cross-Modal
Matching [53.27673119360868]
Referring expression grounding is an important and challenging task in computer vision.
We propose a novel bidirectional cross-modal matching (BiCM) framework to address these challenges.
Our framework outperforms previous works by 6.55% and 9.94% on two popular grounding datasets.
arXiv Detail & Related papers (2022-01-18T01:13:19Z) - Information Symmetry Matters: A Modal-Alternating Propagation Network
for Few-Shot Learning [118.45388912229494]
We propose a Modal-Alternating Propagation Network (MAP-Net) to supplement the absent semantic information of unlabeled samples.
We design a Relation Guidance (RG) strategy to guide the visual relation vectors via semantics so that the propagated information is more beneficial.
Our proposed method achieves promising performance and outperforms the state-of-the-art approaches.
arXiv Detail & Related papers (2021-09-03T03:43:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.