Cross-view Localization and Synthesis -- Datasets, Challenges and Opportunities
- URL: http://arxiv.org/abs/2510.22736v2
- Date: Thu, 30 Oct 2025 18:07:25 GMT
- Title: Cross-view Localization and Synthesis -- Datasets, Challenges and Opportunities
- Authors: Ningli Xu, Rongjun Qin,
- Abstract summary: Cross-view localization and synthesis are two fundamental tasks in cross-view visual understanding.<n>These tasks have gained increasing attention due to their broad applications in autonomous navigation, urban planning, and augmented reality.<n>Recent years have witnessed rapid progress driven by the availability of large-scale datasets and novel approaches.
- Score: 12.433321159554525
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-view localization and synthesis are two fundamental tasks in cross-view visual understanding, which deals with cross-view datasets: overhead (satellite or aerial) and ground-level imagery. These tasks have gained increasing attention due to their broad applications in autonomous navigation, urban planning, and augmented reality. Cross-view localization aims to estimate the geographic position of ground-level images based on information provided by overhead imagery while cross-view synthesis seeks to generate ground-level images based on information from the overhead imagery. Both tasks remain challenging due to significant differences in viewing perspective, resolution, and occlusion, which are widely embedded in cross-view datasets. Recent years have witnessed rapid progress driven by the availability of large-scale datasets and novel approaches. Typically, cross-view localization is formulated as an image retrieval problem where ground-level features are matched with tiled overhead images feature, extracted by convolutional neural networks (CNNs) or vision transformers (ViTs) for cross-view feature embedding. Cross-view synthesis, on the other hand, seeks to generate ground-level views based on information from overhead imagery, generally using generative adversarial networks (GANs) or diffusion models. This paper presents a comprehensive survey of advances in cross-view localization and synthesis, reviewing widely used datasets, highlighting key challenges, and providing an organized overview of state-of-the-art techniques. Furthermore, it discusses current limitations, offers comparative analyses, and outlines promising directions for future research. We also include the project page via https://github.com/GDAOSU/Awesome-Cross-View-Methods.
Related papers
- Geometry-guided Cross-view Diffusion for One-to-many Cross-view Image Synthesis [48.945931374180795]
This paper presents a novel approach for cross-view synthesis aimed at generating plausible ground-level images from corresponding satellite imagery or vice versa.<n>We refer to these tasks as satellite-to-ground (Sat2Grd) and ground-to-satellite (Grd2Sat) synthesis, respectively.
arXiv Detail & Related papers (2024-12-04T13:47:51Z) - Retrieval-guided Cross-view Image Synthesis [3.7477511412024573]
Cross-view image synthesis presents significant challenges in establishing reliable correspondences.<n>We propose a retrieval-guided framework that reimagines how retrieval techniques can facilitate effective cross-view image synthesis.<n>Our work bridges information retrieval and synthesis tasks, offering insights into how retrieval techniques can address complex cross-domain synthesis challenges.
arXiv Detail & Related papers (2024-11-29T07:04:44Z) - CrossViewDiff: A Cross-View Diffusion Model for Satellite-to-Street View Synthesis [54.852701978617056]
CrossViewDiff is a cross-view diffusion model for satellite-to-street view synthesis.
To address the challenges posed by the large discrepancy across views, we design the satellite scene structure estimation and cross-view texture mapping modules.
To achieve a more comprehensive evaluation of the synthesis results, we additionally design a GPT-based scoring method.
arXiv Detail & Related papers (2024-08-27T03:41:44Z) - Multi-view Aggregation Network for Dichotomous Image Segmentation [76.75904424539543]
Dichotomous Image (DIS) has recently emerged towards high-precision object segmentation from high-resolution natural images.
Existing methods rely on tedious multiple encoder-decoder streams and stages to gradually complete the global localization and local refinement.
Inspired by it, we model DIS as a multi-view object perception problem and provide a parsimonious multi-view aggregation network (MVANet)
Experiments on the popular DIS-5K dataset show that our MVANet significantly outperforms state-of-the-art methods in both accuracy and speed.
arXiv Detail & Related papers (2024-04-11T03:00:00Z) - Cross-view Self-localization from Synthesized Scene-graphs [1.9580473532948401]
Cross-view self-localization is a challenging scenario of visual place recognition in which database images are provided from sparse viewpoints.
We propose a new hybrid scene model that combines the advantages of view-invariant appearance features computed from raw images and view-dependent spatial-semantic features computed from synthesized images.
arXiv Detail & Related papers (2023-10-24T04:16:27Z) - Multi-Spectral Image Stitching via Spatial Graph Reasoning [52.27796682972484]
We propose a spatial graph reasoning based multi-spectral image stitching method.
We embed multi-scale complementary features from the same view position into a set of nodes.
By introducing long-range coherence along spatial and channel dimensions, the complementarity of pixel relations and channel interdependencies aids in the reconstruction of aligned multi-view features.
arXiv Detail & Related papers (2023-07-31T15:04:52Z) - CVLNet: Cross-View Semantic Correspondence Learning for Video-based
Camera Localization [89.69214577915959]
This paper tackles the problem of Cross-view Video-based camera localization.
We propose estimating the query camera's relative displacement to a satellite image before similarity matching.
Experiments have demonstrated the effectiveness of video-based localization over single image-based localization.
arXiv Detail & Related papers (2022-08-07T07:35:17Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z) - City-Scale Visual Place Recognition with Deep Local Features Based on
Multi-Scale Ordered VLAD Pooling [5.274399407597545]
We present a fully-automated system for place recognition at a city-scale based on content-based image retrieval.
Firstly, we take a comprehensive analysis of visual place recognition and sketch out the unique challenges of the task.
Next, we propose yet a simple pooling approach on top of convolutional neural network activations to embed the spatial information into the image representation vector.
arXiv Detail & Related papers (2020-09-19T15:21:59Z) - Cross-View Image Retrieval -- Ground to Aerial Image Retrieval through
Deep Learning [3.326320568999945]
We present a novel cross-modal retrieval method specifically for multi-view images, called Cross-view Image Retrieval CVIR.
Our approach aims to find a feature space as well as an embedding space in which samples from street-view images are compared directly to satellite-view images.
For this comparison, a novel deep metric learning based solution "DeepCVIR" has been proposed.
arXiv Detail & Related papers (2020-05-02T06:52:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.