Related papers: CartoMapQA: A Fundamental Benchmark Dataset Evaluating Vision-Language Models on Cartographic Map Understanding

CartoMapQA: A Fundamental Benchmark Dataset Evaluating Vision-Language Models on Cartographic Map Understanding

URL: http://arxiv.org/abs/2512.03558v1
Date: Wed, 03 Dec 2025 08:25:22 GMT
Title: CartoMapQA: A Fundamental Benchmark Dataset Evaluating Vision-Language Models on Cartographic Map Understanding
Authors: Huy Quang Ung, Guillaume Habault, Yasutaka Nishimura, Hao Niu, Roberto Legaspi, Tomoki Oya, Ryoichi Kojima, Masato Taya, Chihiro Ono, Atsunori Minamikawa, Yan Liu,
Abstract summary: We introduce CartoMapQA, a benchmark to evaluate Visual-Language Models' understanding of cartographic maps.<n>The dataset includes over 2000 samples, each composed of a cartographic map, a question (with open-ended or multiple-choice answers), and a ground-truth answer.
Score: 5.925837407110905
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rise of Visual-Language Models (LVLMs) has unlocked new possibilities for seamlessly integrating visual and textual information. However, their ability to interpret cartographic maps remains largely unexplored. In this paper, we introduce CartoMapQA, a benchmark specifically designed to evaluate LVLMs' understanding of cartographic maps through question-answering tasks. The dataset includes over 2000 samples, each composed of a cartographic map, a question (with open-ended or multiple-choice answers), and a ground-truth answer. These tasks span key low-, mid- and high-level map interpretation skills, including symbol recognition, embedded information extraction, scale interpretation, and route-based reasoning. Our evaluation of both open-source and proprietary LVLMs reveals persistent challenges: models frequently struggle with map-specific semantics, exhibit limited geospatial reasoning, and are prone to Optical Character Recognition (OCR)-related errors. By isolating these weaknesses, CartoMapQA offers a valuable tool for guiding future improvements in LVLM architectures. Ultimately, it supports the development of models better equipped for real-world applications that depend on robust and reliable map understanding, such as navigation, geographic search, and urban planning. Our source code and data are openly available to the research community at: https://github.com/ungquanghuy-kddi/CartoMapQA.git

Related papers

MapVerse: A Benchmark for Geospatial Question Answering on Diverse Real-World Maps [22.530685223300523]
MapVerse is a large-scale benchmark built on real-world maps.<n>It comprises 11,837 human-authored question-answer pairs across 1,025 maps.<n>We evaluate ten state-of-the-art models against our benchmark to establish baselines and quantify reasoning gaps.
arXiv Detail & Related papers (2026-02-11T04:36:14Z)
FRIEDA: Benchmarking Multi-Step Cartographic Reasoning in Vision-Language Models [38.67763789694245]
We introduce FRIEDA, a benchmark for testing complex open-ended cartographic reasoning in LVLMs.<n> FRIEDA targets all three categories of spatial relations: topological (border, equal, intersect, within), metric (distance), and directional (orientation)<n>Even the strongest models, Gemini-2.5-Pro and GPT-5-Think, achieve only 38.20% and 37.20% accuracy, far below human performance of 84.87%.
arXiv Detail & Related papers (2025-12-08T20:18:15Z)
MapIQ: Evaluating Multimodal Large Language Models for Map Question Answering [20.408123315555834]
We introduce MapIQ, a benchmark dataset comprising 14,706 question-answer pairs across three map types.<n>We evaluate multiple MLLMs using six visual analytical tasks, comparing their performance against one another and a human baseline.<n>An experiment examining the impact of map design changes provides insights into the robustness and sensitivity of MLLMs.
arXiv Detail & Related papers (2025-07-15T18:02:57Z)
Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps [56.76175383189738]
We introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs.<n>ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates.<n> Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern.
arXiv Detail & Related papers (2025-05-24T12:33:52Z)
MapQaTor: An Extensible Framework for Efficient Annotation of Map-Based QA Datasets [3.3856216159724983]
We introduce MapQaTor, an open-source framework that streamlines the creation of traceable map-based QA datasets.<n>MapQaTor enables seamless integration with any maps API, allowing users to gather and visualize data from diverse sources.
arXiv Detail & Related papers (2024-12-30T15:33:19Z)
MapExplorer: New Content Generation from Low-Dimensional Visualizations [60.02149343347818]
Low-dimensional visualizations, or "projection maps," are widely used to interpret large-scale and complex datasets.<n>These visualizations not only aid in understanding existing knowledge spaces but also implicitly guide exploration into unknown areas.<n>We introduce MapExplorer, a novel knowledge discovery task that translates coordinates within any projection map into coherent, contextually aligned textual content.
arXiv Detail & Related papers (2024-12-24T20:16:13Z)
Distill Visual Chart Reasoning Ability from LLMs to MLLMs [64.32993770646165]
Solving complex chart Q&A tasks requires advanced visual reasoning abilities in multimodal large language models (MLLMs)<n>We propose Code-as-Intermediary Translation (CIT), a cost-effective, efficient and scalable data synthesis method for distilling visual reasoning abilities from LLMs to MLLMs.<n>ReachQA is a dataset containing 3k reasoning-intensive charts and 20k Q&A pairs to enhance both recognition and reasoning abilities of MLLMs.
arXiv Detail & Related papers (2024-10-24T14:50:42Z)
MAPWise: Evaluating Vision-Language Models for Advanced Map Queries [47.15503716894445]
This study investigates the efficacy of vision-language models (VLMs) in answering questions based on maps. We introduce a novel map-based question-answering benchmark, consisting of maps from three geographical regions (United States, India, China) Our benchmark incorporates 43 diverse question templates, requiring nuanced understanding of relative spatial relationships, intricate map features, and complex reasoning.
arXiv Detail & Related papers (2024-08-30T20:57:34Z)
CartoMark: a benchmark dataset for map pattern recognition and 1 map content retrieval with machine intelligence [9.652629004863364]
We develop a large-scale benchmark dataset for map text annotation recognition, map scene classification, map super-resolution reconstruction, and map style transferring. These well-labelled datasets would facilitate the state-of-the-art machine intelligence technologies to conduct map feature detection, map pattern recognition and map content retrieval.
arXiv Detail & Related papers (2023-12-14T01:54:38Z)
Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation [87.52136927091712]
We address a practical yet challenging problem of training robot agents to navigate in an environment following a path described by some language instructions. To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the environment objects. We propose a multi-granularity map, which contains both object fine-grained details (e.g., color, texture) and semantic classes, to represent objects more comprehensively.
arXiv Detail & Related papers (2022-10-14T04:23:27Z)
Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language. We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs. We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.