Geo3DVQA: Evaluating Vision-Language Models for 3D Geospatial Reasoning from Aerial Imagery
- URL: http://arxiv.org/abs/2512.07276v1
- Date: Mon, 08 Dec 2025 08:16:14 GMT
- Title: Geo3DVQA: Evaluating Vision-Language Models for 3D Geospatial Reasoning from Aerial Imagery
- Authors: Mai Tsujimoto, Junjue Wang, Weihao Xuan, Naoto Yokoya,
- Abstract summary: Geo3DVQA is a benchmark for evaluating vision-language models (VLMs) in height-aware, 3D geospatial reasoning.<n>Unlike conventional sensor-based frameworks, Geo3DVQA emphasizes realistic scenarios that integrate elevation, sky view factors, and land cover patterns.
- Score: 18.7420518276348
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Three-dimensional geospatial analysis is critical to applications in urban planning, climate adaptation, and environmental assessment. Current methodologies depend on costly, specialized sensors (e.g., LiDAR and multispectral), which restrict global accessibility. Existing sensor-based and rule-driven methods further struggle with tasks requiring the integration of multiple 3D cues, handling diverse queries, and providing interpretable reasoning. We hereby present Geo3DVQA, a comprehensive benchmark for evaluating vision-language models (VLMs) in height-aware, 3D geospatial reasoning using RGB-only remote sensing imagery. Unlike conventional sensor-based frameworks, Geo3DVQA emphasizes realistic scenarios that integrate elevation, sky view factors, and land cover patterns. The benchmark encompasses 110k curated question-answer pairs spanning 16 task categories across three complexity levels: single-feature inference, multi-feature reasoning, and application-level spatial analysis. The evaluation of ten state-of-the-art VLMs highlights the difficulty of RGB-to-3D reasoning. GPT-4o and Gemini-2.5-Flash achieved only 28.6% and 33.0% accuracy respectively, while domain-specific fine-tuning of Qwen2.5-VL-7B achieved 49.6% (+24.8 points). These results reveal both the limitations of current VLMs and the effectiveness of domain adaptation. Geo3DVQA introduces new challenge frontiers for scalable, accessible, and holistic 3D geospatial analysis. The dataset and code will be released upon publication at https://github.com/mm1129/Geo3DVQA.
Related papers
- GeoFocus: Blending Efficient Global-to-Local Perception for Multimodal Geometry Problem-Solving [55.14836667214487]
GeoFocus is a novel framework comprising two core modules.<n>GeoFocus achieves a 4.7% accuracy improvement over leading specialized models.<n>It demonstrates superior robustness in MATHVERSE under diverse visual conditions.
arXiv Detail & Related papers (2026-02-09T11:15:01Z) - 3dSAGER: Geospatial Entity Resolution over 3D Objects (Technical Report) [7.378893412842889]
3dSAGER is an end-to-end pipeline for geospatial entity resolution over 3D objects.<n>We present a novel, spatial-reference-independent featurization mechanism that captures intricate geometric characteristics of matching pairs.<n>We also propose a new lightweight and interpretable blocking method, BKAFI, that leverages a trained model to efficiently generate high-recall candidate sets.
arXiv Detail & Related papers (2025-11-09T09:35:45Z) - Prompt-Guided Spatial Understanding with RGB-D Transformers for Fine-Grained Object Relation Reasoning [7.670666668651702]
We introduce a dedicated spatial reasoning framework for the Physical AI Spatial Intelligence Warehouse dataset introduced in the Track 3 2025 AI City Challenge.<n>Our approach enhances spatial comprehension by embedding mask dimensions in the form of bounding box coordinates directly into the input prompts.<n>Our comprehensive pipeline achieves a final score of 73.0606, placing 4th overall on the public leaderboard.
arXiv Detail & Related papers (2025-10-13T22:51:20Z) - Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales [61.03549470159347]
Vision-language models (VLMs) have advanced rapidly, yet their capacity for image-grounded geolocation in open-world conditions has not been comprehensively evaluated.<n>We present EarthWhere, a comprehensive benchmark for VLM image geolocation that evaluates visual recognition, step-by-step reasoning, and evidence use.
arXiv Detail & Related papers (2025-10-13T01:12:21Z) - GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields [25.969442927216893]
GeoProg3D is a visual programming framework that enables natural language-driven interactions with city-scale high-fidelity 3D scenes.<n>Our framework employs large language models (LLMs) as reasoning engines to dynamically combine GV-APIs and operate GCLF.<n>Experiments demonstrate that GeoProg3D significantly outperforms existing 3D language fields and vision-language models across multiple tasks.
arXiv Detail & Related papers (2025-06-29T18:03:03Z) - EarthMapper: Visual Autoregressive Models for Controllable Bidirectional Satellite-Map Translation [50.433911327489554]
We introduce EarthMapper, a novel framework for controllable satellite-map translation.<n>We also contribute CNSatMap, a large-scale dataset comprising 302,132 precisely aligned satellite-map pairs across 38 Chinese cities.<n> experiments on CNSatMap and the New York dataset demonstrate EarthMapper's superior performance.
arXiv Detail & Related papers (2025-04-28T02:41:12Z) - Geolocation with Real Human Gameplay Data: A Large-Scale Dataset and Human-Like Reasoning Framework [59.42946541163632]
We introduce a comprehensive geolocation framework with three key components.<n>GeoComp, a large-scale dataset; GeoCoT, a novel reasoning method; and GeoEval, an evaluation metric.<n>We demonstrate that GeoCoT significantly boosts geolocation accuracy by up to 25% while enhancing interpretability.
arXiv Detail & Related papers (2025-02-19T14:21:25Z) - GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks [84.86699025256705]
We present GEOBench-VLM, a benchmark specifically designed to evaluate Vision-Language Models (VLMs) on geospatial tasks.<n>Our benchmark features over 10,000 manually verified instructions and spanning diverse visual conditions, object types, and scales.<n>We evaluate several state-of-the-art VLMs to assess performance on geospatial-specific challenges.
arXiv Detail & Related papers (2024-11-28T18:59:56Z) - Space3D-Bench: Spatial 3D Question Answering Benchmark [49.259397521459114]
We present Space3D-Bench - a collection of 1000 general spatial questions and answers related to scenes of the Replica dataset.
We provide an assessment system that grades natural language responses based on predefined ground-truth answers.
Finally, we introduce a baseline called RAG3D-Chat integrating the world understanding of foundation models with rich context retrieval.
arXiv Detail & Related papers (2024-08-29T16:05:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.