Without Paired Labeled Data: An End-to-End Self-Supervised Paradigm for UAV-View Geo-Localization
- URL: http://arxiv.org/abs/2502.11381v2
- Date: Tue, 01 Apr 2025 03:44:00 GMT
- Title: Without Paired Labeled Data: An End-to-End Self-Supervised Paradigm for UAV-View Geo-Localization
- Authors: Zhongwei Chen, Zhao-Xu Yang, Hai-Jun Rong,
- Abstract summary: UAV-View Geo-Localization (UVGL) aims to achieve accurate localization of unmanned aerial vehicles (UAVs) by retrieving the most relevant GPS-tagged satellite images.<n>Existing methods heavily rely on pre-paired UAV-satellite images for supervised learning.<n>We propose an end-to-end self-supervised UVGL method to overcome these limitations.
- Score: 2.733505168507872
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: UAV-View Geo-Localization (UVGL) aims to achieve accurate localization of unmanned aerial vehicles (UAVs) by retrieving the most relevant GPS-tagged satellite images. However, existing methods heavily rely on pre-paired UAV-satellite images for supervised learning. Such dependency not only incurs high annotation costs but also severely limits scalability and practical deployment in open-world UVGL scenarios. To address these limitations, we propose an end-to-end self-supervised UVGL method. Our method leverages a shallow backbone network to extract initial features, employs clustering to generate pseudo labels, and adopts a dual-path contrastive learning architecture to learn discriminative intra-view representations. Furthermore, our method incorporates two core modules, the dynamic hierarchical memory learning module and the information consistency evolution learning module. The dynamic hierarchical memory learning module combines short-term and long-term memory to enhance intra-view feature consistency and discriminability. Meanwhile, the information consistency evolution learning module leverages a neighborhood-driven dynamic constraint mechanism to systematically capture implicit cross-view semantic correlations, thereby improving cross-view feature alignment. To further stabilize and strengthen the self-supervised training process, a pseudo-label enhancement strategy is introduced, which refines the quality of pseudo supervision. Our method ultimately constructs a unified cross-view feature representation space under self-supervised settings. Extensive experiments on three public benchmark datasets demonstrate that the proposed method consistently outperforms existing self-supervised methods and even surpasses several state-of-the-art supervised methods. Our code is available at https://github.com/ISChenawei/DMNIL.
Related papers
- Semantic-Aligned Learning with Collaborative Refinement for Unsupervised VI-ReID [82.12123628480371]
Unsupervised person re-identification (USL-VI-ReID) seeks to match pedestrian images of the same individual across different modalities without human annotations for model learning.
Previous methods unify pseudo-labels of cross-modality images through label association algorithms and then design contrastive learning framework for global feature learning.
We propose a Semantic-Aligned Learning with Collaborative Refinement (SALCR) framework, which builds up objective for specific fine-grained patterns emphasized by each modality.
arXiv Detail & Related papers (2025-04-27T13:58:12Z) - Precise GPS-Denied UAV Self-Positioning via Context-Enhanced Cross-View Geo-Localization [10.429391988135345]
We propose the Context-Enhanced method for precise UAV Self-Positioning (CEUSP), specifically designed for UAV self-positioning tasks.<n>CEUSP integrates a Dynamic Sampling Strategy (DSS) to efficiently select optimal negative samples, while the Rubik's Cube Attention (RCA) module, combined with the Context-Aware Channel Integration (CACI) module, enhances feature representation and discrimination.<n>Our approach achieves state-of-the-art performance on the DenseUAV dataset, which is specifically designed for dense urban contexts.
arXiv Detail & Related papers (2025-02-17T03:49:18Z) - Extended Cross-Modality United Learning for Unsupervised Visible-Infrared Person Re-identification [34.93081601924748]
Unsupervised learning aims to learn modality-invariant features from unlabeled cross-modality datasets.
Existing methods lack cross-modality clustering or excessively pursue cluster-level association.
We propose Extended Cross-Modality United Learning (ECUL) framework, incorporating Extended Modality-Camera Clustering (EMCC) and Two-Step Memory Updating Strategy (TSMem) modules.
arXiv Detail & Related papers (2024-12-26T09:30:26Z) - SaliencyI2PLoc: saliency-guided image-point cloud localization using contrastive learning [17.29563451509921]
SaliencyI2PLoc is a contrastive learning architecture that fuses the saliency map into feature aggregation.<n>Our method achieves a Recall@1 of 78.92% and a Recall@20 of 97.59% on the urban scenario evaluation dataset.
arXiv Detail & Related papers (2024-12-20T05:20:10Z) - MOOSS: Mask-Enhanced Temporal Contrastive Learning for Smooth State Evolution in Visual Reinforcement Learning [8.61492882526007]
In visual Reinforcement Learning (RL), learning from pixel-based observations poses significant challenges on sample efficiency.
We introduce MOOSS, a novel framework that leverages a temporal contrastive objective with the help of graph-based spatial-temporal masking.
Our evaluation on multiple continuous and discrete control benchmarks shows that MOOSS outperforms previous state-of-the-art visual RL methods in terms of sample efficiency.
arXiv Detail & Related papers (2024-09-02T18:57:53Z) - Into the Unknown: Generating Geospatial Descriptions for New Environments [18.736071151303726]
Rendezvous task requires reasoning over allocentric spatial relationships.
Using opensource descriptions paired with coordinates (e.g., Wikipedia) provides training data but suffers from limited spatially-oriented text.
We propose a large-scale augmentation method for generating high-quality synthetic data for new environments.
arXiv Detail & Related papers (2024-06-28T14:56:21Z) - Federated Multi-Agent Mapping for Planetary Exploration [0.4143603294943439]
We propose an approach to jointly train a centralized map model across agents without the need to share raw data.
Our approach leverages implicit neural mapping to generate parsimonious and adaptable representations.
We demonstrate the efficacy of our proposed federated mapping approach using Martian terrains and glacier datasets.
arXiv Detail & Related papers (2024-04-02T20:32:32Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - Self-Supervised Representation Learning from Temporal Ordering of
Automated Driving Sequences [49.91741677556553]
We propose TempO, a temporal ordering pretext task for pre-training region-level feature representations for perception tasks.
We embed each frame by an unordered set of proposal feature vectors, a representation that is natural for object detection or tracking systems.
Extensive evaluations on the BDD100K, nuImages, and MOT17 datasets show that our TempO pre-training approach outperforms single-frame self-supervised learning methods.
arXiv Detail & Related papers (2023-02-17T18:18:27Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - BEVBert: Multimodal Map Pre-training for Language-guided Navigation [75.23388288113817]
We propose a new map-based pre-training paradigm that is spatial-aware for use in vision-and-language navigation (VLN)
We build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map.
Based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal.
arXiv Detail & Related papers (2022-12-08T16:27:54Z) - Co-visual pattern augmented generative transformer learning for
automobile geo-localization [12.449657263683337]
Cross-view geo-localization (CVGL) aims to estimate the geographical location of the ground-level camera by matching against enormous geo-tagged aerial images.
We present a novel approach using cross-view knowledge generative techniques in combination with transformers, namely mutual generative transformer learning (MGTL) for CVGL.
arXiv Detail & Related papers (2022-03-17T07:29:02Z) - Towards Scale Consistent Monocular Visual Odometry by Learning from the
Virtual World [83.36195426897768]
We propose VRVO, a novel framework for retrieving the absolute scale from virtual data.
We first train a scale-aware disparity network using both monocular real images and stereo virtual data.
The resulting scale-consistent disparities are then integrated with a direct VO system.
arXiv Detail & Related papers (2022-03-11T01:51:54Z) - TCGL: Temporal Contrastive Graph for Self-supervised Video
Representation Learning [79.77010271213695]
We propose a novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL)
Our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter- snippet Temporal Contrastive Graphs (TCG)
To generate supervisory signals for unlabeled videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module.
arXiv Detail & Related papers (2021-12-07T09:27:56Z) - Deep Attention-guided Graph Clustering with Dual Self-supervision [49.040136530379094]
We propose a novel method, namely deep attention-guided graph clustering with dual self-supervision (DAGC)
We develop a dual self-supervision solution consisting of a soft self-supervision strategy with a triplet Kullback-Leibler divergence loss and a hard self-supervision strategy with a pseudo supervision loss.
Our method consistently outperforms state-of-the-art methods on six benchmark datasets.
arXiv Detail & Related papers (2021-11-10T06:53:03Z) - Clustering augmented Self-Supervised Learning: Anapplication to Land
Cover Mapping [10.720852987343896]
We introduce a new method for land cover mapping by using a clustering based pretext task for self-supervised learning.
We demonstrate the effectiveness of the method on two societally relevant applications.
arXiv Detail & Related papers (2021-08-16T19:35:43Z) - Trajectory Design for UAV-Based Internet-of-Things Data Collection: A
Deep Reinforcement Learning Approach [93.67588414950656]
In this paper, we investigate an unmanned aerial vehicle (UAV)-assisted Internet-of-Things (IoT) system in a 3D environment.
We present a TD3-based trajectory design for completion time minimization (TD3-TDCTM) algorithm.
Our simulation results show the superiority of the proposed TD3-TDCTM algorithm over three conventional non-learning based baseline methods.
arXiv Detail & Related papers (2021-07-23T03:33:29Z) - Edge-assisted Democratized Learning Towards Federated Analytics [67.44078999945722]
We show the hierarchical learning structure of the proposed edge-assisted democratized learning mechanism, namely Edge-DemLearn.
We also validate Edge-DemLearn as a flexible model training mechanism to build a distributed control and aggregation methodology in regions.
arXiv Detail & Related papers (2020-12-01T11:46:03Z) - Pairwise Similarity Knowledge Transfer for Weakly Supervised Object
Localization [53.99850033746663]
We study the problem of learning localization model on target classes with weakly supervised image labels.
In this work, we argue that learning only an objectness function is a weak form of knowledge transfer.
Experiments on the COCO and ILSVRC 2013 detection datasets show that the performance of the localization model improves significantly with the inclusion of pairwise similarity function.
arXiv Detail & Related papers (2020-03-18T17:53:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.