Recurrent Cross-View Object Geo-Localization
- URL: http://arxiv.org/abs/2509.12757v1
- Date: Tue, 16 Sep 2025 07:18:23 GMT
- Title: Recurrent Cross-View Object Geo-Localization
- Authors: Xiaohan Zhang, Si-Yuan Cao, Xiaokai Bai, Yiming Li, Zhangkai Shen, Zhe Wu, Xiaoxi Hu, Hui-liang Shen,
- Abstract summary: Cross-view object geo-localization (CVOGL) aims to determine the location of a specific object in high-resolution satellite imagery given a query image with a point prompt.<n>We propose ReCOT, a Recurrent Cross-view Object geo-localization Transformer, which reformulates CVOGL as a recurrent localization task.<n>ReCOT introduces a set of learnable tokens that encode task-specific intent from the query image and prompt embeddings, and iteratively attend to the reference features to refine the predicted location.
- Score: 23.685973292321574
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-view object geo-localization (CVOGL) aims to determine the location of a specific object in high-resolution satellite imagery given a query image with a point prompt. Existing approaches treat CVOGL as a one-shot detection task, directly regressing object locations from cross-view information aggregation, but they are vulnerable to feature noise and lack mechanisms for error correction. In this paper, we propose ReCOT, a Recurrent Cross-view Object geo-localization Transformer, which reformulates CVOGL as a recurrent localization task. ReCOT introduces a set of learnable tokens that encode task-specific intent from the query image and prompt embeddings, and iteratively attend to the reference features to refine the predicted location. To enhance this recurrent process, we incorporate two complementary modules: (1) a SAM-based knowledge distillation strategy that transfers segmentation priors from the Segment Anything Model (SAM) to provide clearer semantic guidance without additional inference cost, and (2) a Reference Feature Enhancement Module (RFEM) that introduces a hierarchical attention to emphasize object-relevant regions in the reference features. Extensive experiments on standard CVOGL benchmarks demonstrate that ReCOT achieves state-of-the-art (SOTA) performance while reducing parameters by 60% compared to previous SOTA approaches.
Related papers
- Improving Cross-view Object Geo-localization: A Dual Attention Approach with Cross-view Interaction and Multi-Scale Spatial Features [0.0]
Cross-view object geo-localization has recently gained attention due to potential applications.<n>We introduce a Cross-view and Cross-attention Module (CVCAM), which performs multiple iterations of interaction between the two views.<n>We also create a new dataset called G2D for the "Ground-to-Drone" localization task.
arXiv Detail & Related papers (2025-10-31T03:28:59Z) - Anchor-free Cross-view Object Geo-localization with Gaussian Position Encoding and Cross-view Association [3.5982006325887554]
We propose an anchor-free formulation for cross-view object geo-localization, termed AFGeo.<n> AFGeo directly predicts the four directional offsets to the ground-truth box for each pixel localizing the object without any predefined anchors.<n>Our model is both lightweight and efficient, achieving state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2025-09-30T00:30:45Z) - Exploring Efficient Open-Vocabulary Segmentation in the Remote Sensing [55.291219073365546]
Open-Vocabulary Remote Sensing Image (OVRSIS) is an emerging task that adapts Open-Vocabulary (OVS) to the remote sensing (RS) domain.<n>textbfRSKT-Seg is a novel open-vocabulary segmentation framework tailored for remote sensing.<n> RSKT-Seg consistently outperforms strong OVS baselines by +3.8 mIoU and +5.9 mACC, while achieving 2x faster inference through efficient aggregation.
arXiv Detail & Related papers (2025-09-15T15:24:49Z) - Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation [50.433911327489554]
The goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression.<n>To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM)<n>To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets.
arXiv Detail & Related papers (2024-10-11T08:28:04Z) - Decoupled DETR: Spatially Disentangling Localization and Classification
for Improved End-to-End Object Detection [48.429555904690595]
We introduce spatially decoupled DETR, which includes a task-aware query generation module and a disentangled feature learning process.
We demonstrate that our approach achieves a significant improvement in MSCOCO datasets compared to previous work.
arXiv Detail & Related papers (2023-10-24T15:54:11Z) - Background Activation Suppression for Weakly Supervised Object
Localization and Semantic Segmentation [84.62067728093358]
Weakly supervised object localization and semantic segmentation aim to localize objects using only image-level labels.
New paradigm has emerged by generating a foreground prediction map to achieve pixel-level localization.
This paper presents two astonishing experimental observations on the object localization learning process.
arXiv Detail & Related papers (2023-09-22T15:44:10Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - Retrieval and Localization with Observation Constraints [12.010135672015704]
We propose an integrated visual re-localization method called RLOCS.
It combines image retrieval, semantic consistency and geometry verification to achieve accurate estimations.
Our method achieves many performance improvements on the challenging localization benchmarks.
arXiv Detail & Related papers (2021-08-19T06:14:33Z) - Unveiling the Potential of Structure-Preserving for Weakly Supervised
Object Localization [71.79436685992128]
We propose a two-stage approach, termed structure-preserving activation (SPA), towards fully leveraging the structure information incorporated in convolutional features for WSOL.
In the first stage, a restricted activation module (RAM) is designed to alleviate the structure-missing issue caused by the classification network.
In the second stage, we propose a post-process approach, termed self-correlation map generating (SCG) module to obtain structure-preserving localization maps.
arXiv Detail & Related papers (2021-03-08T03:04:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.