Related papers: MapVerse: A Benchmark for Geospatial Question Answering on Diverse Real-World Maps

MapVerse: A Benchmark for Geospatial Question Answering on Diverse Real-World Maps

URL: http://arxiv.org/abs/2602.10518v1
Date: Wed, 11 Feb 2026 04:36:14 GMT
Title: MapVerse: A Benchmark for Geospatial Question Answering on Diverse Real-World Maps
Authors: Sharat Bhat, Harshita Khandelwal, Tushar Kataria, Vivek Gupta,
Abstract summary: MapVerse is a large-scale benchmark built on real-world maps.<n>It comprises 11,837 human-authored question-answer pairs across 1,025 maps.<n>We evaluate ten state-of-the-art models against our benchmark to establish baselines and quantify reasoning gaps.
Score: 22.530685223300523
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Maps are powerful carriers of structured and contextual knowledge, encompassing geography, demographics, infrastructure, and environmental patterns. Reasoning over such knowledge requires models to integrate spatial relationships, visual cues, real-world context, and domain-specific expertise-capabilities that current large language models (LLMs) and vision-language models (VLMs) still struggle to exhibit consistently. Yet, datasets used to benchmark VLMs on map-based reasoning remain narrow in scope, restricted to specific domains, and heavily reliant on artificially generated content (outputs from LLMs or pipeline-based methods), offering limited depth for evaluating genuine geospatial reasoning. To address this gap, we present MapVerse, a large-scale benchmark built on real-world maps. It comprises 11,837 human-authored question-answer pairs across 1,025 maps, spanning ten diverse map categories and multiple question categories for each. The dataset provides a rich setting for evaluating map reading, interpretation, and multimodal reasoning. We evaluate ten state-of-the-art models against our benchmark to establish baselines and quantify reasoning gaps. Beyond overall performance, we conduct fine-grained categorical analyses to assess model inference across multiple dimensions and investigate the visual factors shaping reasoning outcomes. Our findings reveal that while current VLMs perform competitively on classification-style tasks, both open- and closed-source models fall short on advanced tasks requiring complex spatial reasoning.

Related papers

OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents [68.85365034738534]
We introduce a unified framework for developing tool-augmented geospatial agents trained on satellite imagery, natural-language queries, and detailed reasoning traces.<n>The training pipeline relies on supervised fine-tuning over structured reasoning trajectories, aligning the model with verified multistep tool interactions.<n>The accompanying corpus comprises 14,538 training and 1,169 evaluation instances, with more than 100K reasoning steps in the training split and over 7K reasoning steps in the evaluation split.
arXiv Detail & Related papers (2026-02-19T18:59:54Z)
Thinking on Maps: How Foundation Model Agents Explore, Remember, and Reason Map Environments [10.485672302572368]
Map environments provide a fundamental medium for representing spatial structure. Understanding how foundation model (FM) agents understand and act in such environments is critical for enabling reliable map-based reasoning and applications.<n>We propose an interactive evaluation framework to analyze how FM agents explore, remember, and reason in symbolic map environments.
arXiv Detail & Related papers (2025-12-30T23:04:29Z)
FRIEDA: Benchmarking Multi-Step Cartographic Reasoning in Vision-Language Models [38.67763789694245]
We introduce FRIEDA, a benchmark for testing complex open-ended cartographic reasoning in LVLMs.<n> FRIEDA targets all three categories of spatial relations: topological (border, equal, intersect, within), metric (distance), and directional (orientation)<n>Even the strongest models, Gemini-2.5-Pro and GPT-5-Think, achieve only 38.20% and 37.20% accuracy, far below human performance of 84.87%.
arXiv Detail & Related papers (2025-12-08T20:18:15Z)
CartoMapQA: A Fundamental Benchmark Dataset Evaluating Vision-Language Models on Cartographic Map Understanding [5.925837407110905]
We introduce CartoMapQA, a benchmark to evaluate Visual-Language Models' understanding of cartographic maps.<n>The dataset includes over 2000 samples, each composed of a cartographic map, a question (with open-ended or multiple-choice answers), and a ground-truth answer.
arXiv Detail & Related papers (2025-12-03T08:25:22Z)
Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps [56.76175383189738]
We introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs.<n>ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates.<n> Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern.
arXiv Detail & Related papers (2025-05-24T12:33:52Z)
PEACE: Empowering Geologic Map Holistic Understanding with MLLMs [64.58959634712215]
Geologic map, as a fundamental diagram in geology science, provides critical insights into the structure and composition of Earth's subsurface and surface.<n>Despite their significance, current Multimodal Large Language Models (MLLMs) often fall short in geologic map understanding.<n>To quantify this gap, we construct GeoMap-Bench, the first-ever benchmark for evaluating MLLMs in geologic map understanding.
arXiv Detail & Related papers (2025-01-10T18:59:42Z)
MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models [7.422346909538787]
MapEval is a benchmark designed to assess foundation models across three distinct tasks.<n>It covers spatial relationships, navigation, travel planning, and real-world map interactions.<n>It requires models to handle long-context reasoning, API interactions, and visual map analysis.
arXiv Detail & Related papers (2024-12-31T07:20:32Z)
MapExplorer: New Content Generation from Low-Dimensional Visualizations [60.02149343347818]
Low-dimensional visualizations, or "projection maps," are widely used to interpret large-scale and complex datasets.<n>These visualizations not only aid in understanding existing knowledge spaces but also implicitly guide exploration into unknown areas.<n>We introduce MapExplorer, a novel knowledge discovery task that translates coordinates within any projection map into coherent, contextually aligned textual content.
arXiv Detail & Related papers (2024-12-24T20:16:13Z)
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks [84.86699025256705]
We present GEOBench-VLM, a benchmark specifically designed to evaluate Vision-Language Models (VLMs) on geospatial tasks.<n>Our benchmark features over 10,000 manually verified instructions and spanning diverse visual conditions, object types, and scales.<n>We evaluate several state-of-the-art VLMs to assess performance on geospatial-specific challenges.
arXiv Detail & Related papers (2024-11-28T18:59:56Z)
MAPWise: Evaluating Vision-Language Models for Advanced Map Queries [47.15503716894445]
This study investigates the efficacy of vision-language models (VLMs) in answering questions based on maps. We introduce a novel map-based question-answering benchmark, consisting of maps from three geographical regions (United States, India, China) Our benchmark incorporates 43 diverse question templates, requiring nuanced understanding of relative spatial relationships, intricate map features, and complex reasoning.
arXiv Detail & Related papers (2024-08-30T20:57:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.