MapGlue: Multimodal Remote Sensing Image Matching
- URL: http://arxiv.org/abs/2503.16185v1
- Date: Thu, 20 Mar 2025 14:36:16 GMT
- Title: MapGlue: Multimodal Remote Sensing Image Matching
- Authors: Peihao Wu, Yongxiang Yao, Wenfei Zhang, Dong Wei, Yi Wan, Yansheng Li, Yongjun Zhang,
- Abstract summary: Multimodal remote sensing image (MRSI) matching is pivotal for cross-modal fusion, localization, and object detection.<n>Existing unimodal datasets lack scale and diversity, limiting deep learning solutions.<n>This paper proposes MapGlue, a universal MRSI matching framework, and MapData, a large-scale multimodal dataset addressing these gaps.
- Score: 12.376931699274062
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal remote sensing image (MRSI) matching is pivotal for cross-modal fusion, localization, and object detection, but it faces severe challenges due to geometric, radiometric, and viewpoint discrepancies across imaging modalities. Existing unimodal datasets lack scale and diversity, limiting deep learning solutions. This paper proposes MapGlue, a universal MRSI matching framework, and MapData, a large-scale multimodal dataset addressing these gaps. Our contributions are twofold. MapData, a globally diverse dataset spanning 233 sampling points, offers original images (7,000x5,000 to 20,000x15,000 pixels). After rigorous cleaning, it provides 121,781 aligned electronic map-visible image pairs (512x512 pixels) with hybrid manual-automated ground truth, addressing the scarcity of scalable multimodal benchmarks. MapGlue integrates semantic context with a dual graph-guided mechanism to extract cross-modal invariant features. This structure enables global-to-local interaction, enhancing descriptor robustness against modality-specific distortions. Extensive evaluations on MapData and five public datasets demonstrate MapGlue's superiority in matching accuracy under complex conditions, outperforming state-of-the-art methods. Notably, MapGlue generalizes effectively to unseen modalities without retraining, highlighting its adaptability. This work addresses longstanding challenges in MRSI matching by combining scalable dataset construction with a robust, semantics-driven framework. Furthermore, MapGlue shows strong generalization capabilities on other modality matching tasks for which it was not specifically trained. The dataset and code are available at https://github.com/PeihaoWu/MapGlue.
Related papers
- MINIMA: Modality Invariant Image Matching [52.505282811925454]
We present MINIMA, a unified image matching framework for multiple cross-modal cases.<n>We scale up the modalities from cheap but rich RGB-only matching data, by means of generative models.<n>With MD-syn, we can directly train any advanced matching pipeline on randomly selected modality pairs to obtain cross-modal ability.
arXiv Detail & Related papers (2024-12-27T02:39:50Z) - ASANet: Asymmetric Semantic Aligning Network for RGB and SAR image land cover classification [5.863175733097434]
We propose a novel architecture, named the Asymmetric Semantic Aligning Network (ASANet) to address the issue of asymmetry at the feature level.
The proposed ASANet effectively learns feature correlations between the two modalities and eliminates noise caused by feature differences.
We have established a new RGB-SAR multimodal dataset, on which our ASANet outperforms other mainstream methods with improvements ranging from 1.21% to 17.69%.
arXiv Detail & Related papers (2024-12-03T00:03:33Z) - Maps from Motion (MfM): Generating 2D Semantic Maps from Sparse Multi-view Images [17.992488467380923]
OpenStreetMap is the result of 11 million registered users manually annotating the GPS location of over 1.75 billion entries.
At the same time, manual annotations can include errors and are slow to update, limiting the map's accuracy.
Maps from Motion (MfM) is a step forward to automatize such time-consuming map making procedure by computing 2D maps of semantic objects directly from a collection of uncalibrated multi-view images.
arXiv Detail & Related papers (2024-11-19T16:27:31Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - Pink: Unveiling the Power of Referential Comprehension for Multi-modal
LLMs [49.88461345825586]
This paper proposes a new framework to enhance the fine-grained image understanding abilities of MLLMs.
We present a new method for constructing the instruction tuning dataset at a low cost by leveraging annotations in existing datasets.
We show that our model exhibits a 5.2% accuracy improvement over Qwen-VL and surpasses the accuracy of Kosmos-2 by 24.7%.
arXiv Detail & Related papers (2023-10-01T05:53:15Z) - Unpaired Image Super-Resolution with Optimal Transport Maps [128.1189695209663]
Real-world image super-resolution (SR) tasks often do not have paired datasets limiting the application of supervised techniques.
We propose an algorithm for unpaired SR which learns an unbiased OT map for the perceptual transport cost.
Our algorithm provides nearly state-of-the-art performance on the large-scale unpaired AIM-19 dataset.
arXiv Detail & Related papers (2022-02-02T16:21:20Z) - Graph Sampling Based Deep Metric Learning for Generalizable Person
Re-Identification [114.56752624945142]
We argue that the most popular random sampling method, the well-known PK sampler, is not informative and efficient for deep metric learning.
We propose an efficient mini batch sampling method called Graph Sampling (GS) for large-scale metric learning.
arXiv Detail & Related papers (2021-04-04T06:44:15Z) - Generating Multi-scale Maps from Remote Sensing Images via Series
Generative Adversarial Networks [12.34648824166359]
We develop a series strategy of generators for multi-scale rs2map translation.
High-resolution RSIs are inputted to an rs2map model to output large-scale maps.
Experiments show better quality multi-scale map generation with the series strategy.
arXiv Detail & Related papers (2021-03-31T08:58:37Z) - Adaptive Context-Aware Multi-Modal Network for Depth Completion [107.15344488719322]
We propose to adopt the graph propagation to capture the observed spatial contexts.
We then apply the attention mechanism on the propagation, which encourages the network to model the contextual information adaptively.
Finally, we introduce the symmetric gated fusion strategy to exploit the extracted multi-modal features effectively.
Our model, named Adaptive Context-Aware Multi-Modal Network (ACMNet), achieves the state-of-the-art performance on two benchmarks.
arXiv Detail & Related papers (2020-08-25T06:00:06Z) - X-ModalNet: A Semi-Supervised Deep Cross-Modal Network for
Classification of Remote Sensing Data [69.37597254841052]
We propose a novel cross-modal deep-learning framework called X-ModalNet.
X-ModalNet generalizes well, owing to propagating labels on an updatable graph constructed by high-level features on the top of the network.
We evaluate X-ModalNet on two multi-modal remote sensing datasets (HSI-MSI and HSI-SAR) and achieve a significant improvement in comparison with several state-of-the-art methods.
arXiv Detail & Related papers (2020-06-24T15:29:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.