Related papers: FRIEDA: Benchmarking Multi-Step Cartographic Reasoning in Vision-Language Models

FRIEDA: Benchmarking Multi-Step Cartographic Reasoning in Vision-Language Models

URL: http://arxiv.org/abs/2512.08016v1
Date: Mon, 08 Dec 2025 20:18:15 GMT
Title: FRIEDA: Benchmarking Multi-Step Cartographic Reasoning in Vision-Language Models
Authors: Jiyoon Pyo, Yuankun Jiao, Dongwon Jung, Zekun Li, Leeje Jang, Sofia Kirsanova, Jina Kim, Yijun Lin, Qin Liu, Junyi Xie, Hadi Askari, Nan Xu, Muhao Chen, Yao-Yi Chiang,
Abstract summary: We introduce FRIEDA, a benchmark for testing complex open-ended cartographic reasoning in LVLMs.<n> FRIEDA targets all three categories of spatial relations: topological (border, equal, intersect, within), metric (distance), and directional (orientation)<n>Even the strongest models, Gemini-2.5-Pro and GPT-5-Think, achieve only 38.20% and 37.20% accuracy, far below human performance of 84.87%.
Score: 38.67763789694245
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Cartographic reasoning is the skill of interpreting geographic relationships by aligning legends, map scales, compass directions, map texts, and geometries across one or more map images. Although essential as a concrete cognitive capability and for critical tasks such as disaster response and urban planning, it remains largely unevaluated. Building on progress in chart and infographic understanding, recent large vision language model studies on map visual question-answering often treat maps as a special case of charts. In contrast, map VQA demands comprehension of layered symbology (e.g., symbols, geometries, and text labels) as well as spatial relations tied to orientation and distance that often span multiple maps and are not captured by chart-style evaluations. To address this gap, we introduce FRIEDA, a benchmark for testing complex open-ended cartographic reasoning in LVLMs. FRIEDA sources real map images from documents and reports in various domains and geographical areas. Following classifications in Geographic Information System (GIS) literature, FRIEDA targets all three categories of spatial relations: topological (border, equal, intersect, within), metric (distance), and directional (orientation). All questions require multi-step inference, and many require cross-map grounding and reasoning. We evaluate eleven state-of-the-art LVLMs under two settings: (1) the direct setting, where we provide the maps relevant to the question, and (2) the contextual setting, where the model may have to identify the maps relevant to the question before reasoning. Even the strongest models, Gemini-2.5-Pro and GPT-5-Think, achieve only 38.20% and 37.20% accuracy, respectively, far below human performance of 84.87%. These results reveal a persistent gap in multi-step cartographic reasoning, positioning FRIEDA as a rigorous benchmark to drive progress on spatial intelligence in LVLMs.

Related papers

EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery [16.896854321525918]
Existing benchmarks for Earth imagery primarily focus on 2D spatial grounding, image captioning, and coarse spatial relations.<n>We propose textbfEarthSpatialBench, a comprehensive benchmark for evaluating spatial reasoning in MLLMs on Earth imagery.<n>The benchmark contains over 325K question-answer pairs spanning: (1) qualitative and quantitative reasoning about spatial distance and direction; (2) systematic topological relations; (3) single-object queries, object-pair queries, and compositional aggregate group queries; and (4) object references expressed via textual descriptions, visual overlays, and explicit geometry coordinates, including 2D bounding boxes, polylines, and polygons
arXiv Detail & Related papers (2026-02-17T06:08:43Z)
MapVerse: A Benchmark for Geospatial Question Answering on Diverse Real-World Maps [22.530685223300523]
MapVerse is a large-scale benchmark built on real-world maps.<n>It comprises 11,837 human-authored question-answer pairs across 1,025 maps.<n>We evaluate ten state-of-the-art models against our benchmark to establish baselines and quantify reasoning gaps.
arXiv Detail & Related papers (2026-02-11T04:36:14Z)
CartoMapQA: A Fundamental Benchmark Dataset Evaluating Vision-Language Models on Cartographic Map Understanding [5.925837407110905]
We introduce CartoMapQA, a benchmark to evaluate Visual-Language Models' understanding of cartographic maps.<n>The dataset includes over 2000 samples, each composed of a cartographic map, a question (with open-ended or multiple-choice answers), and a ground-truth answer.
arXiv Detail & Related papers (2025-12-03T08:25:22Z)
MapIQ: Evaluating Multimodal Large Language Models for Map Question Answering [20.408123315555834]
We introduce MapIQ, a benchmark dataset comprising 14,706 question-answer pairs across three map types.<n>We evaluate multiple MLLMs using six visual analytical tasks, comparing their performance against one another and a human baseline.<n>An experiment examining the impact of map design changes provides insights into the robustness and sensitivity of MLLMs.
arXiv Detail & Related papers (2025-07-15T18:02:57Z)
EarthMapper: Visual Autoregressive Models for Controllable Bidirectional Satellite-Map Translation [50.433911327489554]
We introduce EarthMapper, a novel framework for controllable satellite-map translation.<n>We also contribute CNSatMap, a large-scale dataset comprising 302,132 precisely aligned satellite-map pairs across 38 Chinese cities.<n> experiments on CNSatMap and the New York dataset demonstrate EarthMapper's superior performance.
arXiv Detail & Related papers (2025-04-28T02:41:12Z)
PEACE: Empowering Geologic Map Holistic Understanding with MLLMs [64.58959634712215]
Geologic map, as a fundamental diagram in geology science, provides critical insights into the structure and composition of Earth's subsurface and surface.<n>Despite their significance, current Multimodal Large Language Models (MLLMs) often fall short in geologic map understanding.<n>To quantify this gap, we construct GeoMap-Bench, the first-ever benchmark for evaluating MLLMs in geologic map understanding.
arXiv Detail & Related papers (2025-01-10T18:59:42Z)
MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models [7.422346909538787]
MapEval is a benchmark designed to assess foundation models across three distinct tasks.<n>It covers spatial relationships, navigation, travel planning, and real-world map interactions.<n>It requires models to handle long-context reasoning, API interactions, and visual map analysis.
arXiv Detail & Related papers (2024-12-31T07:20:32Z)
MAPWise: Evaluating Vision-Language Models for Advanced Map Queries [47.15503716894445]
This study investigates the efficacy of vision-language models (VLMs) in answering questions based on maps. We introduce a novel map-based question-answering benchmark, consisting of maps from three geographical regions (United States, India, China) Our benchmark incorporates 43 diverse question templates, requiring nuanced understanding of relative spatial relationships, intricate map features, and complex reasoning.
arXiv Detail & Related papers (2024-08-30T20:57:34Z)
GeoGLUE: A GeoGraphic Language Understanding Evaluation Benchmark [56.08664336835741]
We propose a GeoGraphic Language Understanding Evaluation benchmark, named GeoGLUE. We collect data from open-released geographic resources and introduce six natural language understanding tasks. We pro vide evaluation experiments and analysis of general baselines, indicating the effectiveness and significance of the GeoGLUE benchmark.
arXiv Detail & Related papers (2023-05-11T03:21:56Z)
BEVBert: Multimodal Map Pre-training for Language-guided Navigation [75.23388288113817]
We propose a new map-based pre-training paradigm that is spatial-aware for use in vision-and-language navigation (VLN) We build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map. Based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal.
arXiv Detail & Related papers (2022-12-08T16:27:54Z)
Structured Landmark Detection via Topology-Adapting Deep Graph Learning [75.20602712947016]
We present a new topology-adapting deep graph learning approach for accurate anatomical facial and medical landmark detection. The proposed method constructs graph signals leveraging both local image features and global shape features. Experiments are conducted on three public facial image datasets (WFLW, 300W, and COFW-68) as well as three real-world X-ray medical datasets (Cephalometric (public), Hand and Pelvis)
arXiv Detail & Related papers (2020-04-17T11:55:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.