MarsRetrieval: Benchmarking Vision-Language Models for Planetary-Scale Geospatial Retrieval on Mars
- URL: http://arxiv.org/abs/2602.13961v1
- Date: Sun, 15 Feb 2026 02:41:56 GMT
- Title: MarsRetrieval: Benchmarking Vision-Language Models for Planetary-Scale Geospatial Retrieval on Mars
- Authors: Shuoyuan Wang, Yiran Wang, Hongxin Wei,
- Abstract summary: We introduce MarsRetrieval, a retrieval benchmark for evaluating vision-language models for Martian geospatial discovery.<n>We propose a unified retrieval-centric protocol to benchmark multimodal embedding architectures.<n>Our evaluation shows MarsRetrieval is challenging, even strong foundation models often fail to capture domain-specific geomorphic distinctions.
- Score: 21.01507072531742
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Data-driven approaches like deep learning are rapidly advancing planetary science, particularly in Mars exploration. Despite recent progress, most existing benchmarks remain confined to closed-set supervised visual tasks and do not support text-guided retrieval for geospatial discovery. We introduce MarsRetrieval, a retrieval benchmark for evaluating vision-language models for Martian geospatial discovery. MarsRetrieval includes three tasks: (1) paired image-text retrieval, (2) landform retrieval, and (3) global geo-localization, covering multiple spatial scales and diverse geomorphic origins. We propose a unified retrieval-centric protocol to benchmark multimodal embedding architectures, including contrastive dual-tower encoders and generative vision-language models. Our evaluation shows MarsRetrieval is challenging: even strong foundation models often fail to capture domain-specific geomorphic distinctions. We further show that domain-specific fine-tuning is critical for generalizable geospatial discovery in planetary settings. Our code is available at https://github.com/ml-stat-Sustech/MarsRetrieval
Related papers
- Natural Language-Driven Global Mapping of Martian Landforms [25.54158424879149]
MarScope is a vision-language framework enabling natural language-driven, label-free mapping of Martian landforms.<n>It aligns planetary images and text in a shared semantic space, trained on over 200,000 curated image-text pairs.<n>This framework transforms global geomorphic mapping on Mars by replacing pre-defined classifications with flexible semantic retrieval.
arXiv Detail & Related papers (2026-01-22T13:38:13Z) - Mars-Bench: A Benchmark for Evaluating Foundation Models for Mars Science Tasks [7.399515278460871]
A key enabler of progress in other domains has been the availability of standardized benchmarks that support systematic evaluation.<n>We introduce Mars-Bench, the first benchmark designed to systematically evaluate models across a broad range of Mars-related tasks.<n>We provide standardized, ready-to-use datasets and baseline evaluations using models pre-trained on natural images, Earth satellite data, and state-of-the-art vision-language models.
arXiv Detail & Related papers (2025-10-28T02:34:08Z) - Inpainting the Red Planet: Diffusion Models for the Reconstruction of Martian Environments in Virtual Reality [0.0]
Training was conducted on an augmented dataset of 12000 Martian heightmaps derived from NASA's HiRISE survey.<n>A non-homogeneous rescaling strategy captures terrain features across multiple scales before resizing to a fixed 128x128 model resolution.<n>Results show that our approach consistently outperforms these methods in terms of reconstruction accuracy (4-15% on RMSE) and perceptual similarity (29-81% on LPIPS) with the original data.
arXiv Detail & Related papers (2025-10-16T15:02:05Z) - Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales [61.03549470159347]
Vision-language models (VLMs) have advanced rapidly, yet their capacity for image-grounded geolocation in open-world conditions has not been comprehensively evaluated.<n>We present EarthWhere, a comprehensive benchmark for VLM image geolocation that evaluates visual recognition, step-by-step reasoning, and evidence use.
arXiv Detail & Related papers (2025-10-13T01:12:21Z) - Martian World Models: Controllable Video Synthesis with Physically Accurate 3D Reconstructions [116.56517155163716]
We propose a data curation pipeline that reconstructs 3D Martian environments from real stereo navigation images.<n>A Martian terrain video generator, MarsGen, synthesizes novel videos visually realistic and geometrically consistent with the 3D structure encoded in the data.<n>Our approach outperforms video synthesis models trained on terrestrial datasets, achieving superior visual fidelity and 3D structural consistency.
arXiv Detail & Related papers (2025-07-10T17:54:27Z) - TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation [65.74990259650984]
We introduce TerraFM, a scalable self-supervised learning model that leverages globally distributed Sentinel-1 and Sentinel-2 imagery.<n>Our training strategy integrates local-global contrastive learning and introduces a dual-centering mechanism.<n>TerraFM achieves strong generalization on both classification and segmentation tasks, outperforming prior models on GEO-Bench and Copernicus-Bench.
arXiv Detail & Related papers (2025-06-06T17:59:50Z) - EarthMapper: Visual Autoregressive Models for Controllable Bidirectional Satellite-Map Translation [50.433911327489554]
We introduce EarthMapper, a novel framework for controllable satellite-map translation.<n>We also contribute CNSatMap, a large-scale dataset comprising 302,132 precisely aligned satellite-map pairs across 38 Chinese cities.<n> experiments on CNSatMap and the New York dataset demonstrate EarthMapper's superior performance.
arXiv Detail & Related papers (2025-04-28T02:41:12Z) - Geolocation with Real Human Gameplay Data: A Large-Scale Dataset and Human-Like Reasoning Framework [59.42946541163632]
We introduce a comprehensive geolocation framework with three key components.<n>GeoComp, a large-scale dataset; GeoCoT, a novel reasoning method; and GeoEval, an evaluation metric.<n>We demonstrate that GeoCoT significantly boosts geolocation accuracy by up to 25% while enhancing interpretability.
arXiv Detail & Related papers (2025-02-19T14:21:25Z) - TorchSpatial: A Location Encoding Framework and Benchmark for Spatial Representation Learning [36.725822223732635]
We propose TorchSpatial, a learning framework and benchmark for location (point) encoding.<n>TorchSpatial contains three key components: 1) a unified location encoding framework that consolidates 15 commonly recognized location encoders; 2) the LocBench benchmark tasks encompassing 7 geo-aware image classification and 10 geo-aware image regression datasets; and 3) a comprehensive suite of evaluation metrics to quantify geo-aware model's overall performance as well as their geographic bias, with a novel Geo-Bias Score metric.
arXiv Detail & Related papers (2024-06-21T21:33:16Z) - ConeQuest: A Benchmark for Cone Segmentation on Mars [9.036303895516745]
ConeQuest is the first expert-annotated public dataset to identify cones on Mars.
We propose two benchmark tasks using ConeQuest: (i) Spatial Generalization and (ii) Cone-size Generalization.
arXiv Detail & Related papers (2023-11-15T02:33:08Z) - Towards Robust Monocular Visual Odometry for Flying Robots on Planetary
Missions [49.79068659889639]
Ingenuity, that just landed on Mars, will mark the beginning of a new era of exploration unhindered by traversability.
We present an advanced robust monocular odometry algorithm that uses efficient optical flow tracking.
We also present a novel approach to estimate the current risk of scale drift based on a principal component analysis of the relative translation information matrix.
arXiv Detail & Related papers (2021-09-12T12:52:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.