MAPWise: Evaluating Vision-Language Models for Advanced Map Queries
- URL: http://arxiv.org/abs/2409.00255v1
- Date: Fri, 30 Aug 2024 20:57:34 GMT
- Title: MAPWise: Evaluating Vision-Language Models for Advanced Map Queries
- Authors: Srija Mukhopadhyay, Abhishek Rajgaria, Prerana Khatiwada, Vivek Gupta, Dan Roth,
- Abstract summary: This study investigates the efficacy of vision-language models (VLMs) in answering questions based on maps.
We introduce a novel map-based question-answering benchmark, consisting of maps from three geographical regions (United States, India, China)
Our benchmark incorporates 43 diverse question templates, requiring nuanced understanding of relative spatial relationships, intricate map features, and complex reasoning.
- Score: 47.15503716894445
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language models (VLMs) excel at tasks requiring joint understanding of visual and linguistic information. A particularly promising yet under-explored application for these models lies in answering questions based on various kinds of maps. This study investigates the efficacy of VLMs in answering questions based on choropleth maps, which are widely used for data analysis and representation. To facilitate and encourage research in this area, we introduce a novel map-based question-answering benchmark, consisting of maps from three geographical regions (United States, India, China), each containing 1000 questions. Our benchmark incorporates 43 diverse question templates, requiring nuanced understanding of relative spatial relationships, intricate map features, and complex reasoning. It also includes maps with discrete and continuous values, encompassing variations in color-mapping, category ordering, and stylistic patterns, enabling comprehensive analysis. We evaluate the performance of multiple VLMs on this benchmark, highlighting gaps in their abilities and providing insights for improving such models.
Related papers
- Targeted Visual Prompting for Medical Visual Question Answering [3.600327818936722]
multimodal large language models (MLLMs) have emerged as an alternative to classical model architectures.
Simple visual errors cast doubt on the actual visual understanding abilities of these models.
This paper introduces targeted visual prompting to equip MLLMs with region-based questioning capabilities.
arXiv Detail & Related papers (2024-08-06T08:58:20Z) - Unraveling the Truth: Do VLMs really Understand Charts? A Deep Dive into Consistency and Robustness [47.68358935792437]
Chart question answering (CQA) is a crucial area of Visual Language Understanding.
Current Visual Language Models (VLMs) in this field remain under-explored.
This paper evaluates state-of-the-art VLMs on comprehensive datasets.
arXiv Detail & Related papers (2024-07-15T20:29:24Z) - CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs [62.84082370758761]
CharXiv is a comprehensive evaluation suite involving 2,323 charts from arXiv papers.
To ensure quality, all charts and questions are handpicked, curated, and verified by human experts.
Results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model.
arXiv Detail & Related papers (2024-06-26T17:50:11Z) - Towards Vision-Language Geo-Foundation Model: A Survey [65.70547895998541]
Vision-Language Foundation Models (VLFMs) have made remarkable progress on various multimodal tasks.
This paper thoroughly reviews VLGFMs, summarizing and analyzing recent developments in the field.
arXiv Detail & Related papers (2024-06-13T17:57:30Z) - Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning [40.972648044298374]
Multi-Modal Large Language Models (MLLMs) have demonstrated impressive performance in various VQA tasks.
They often lack interpretability and struggle with complex visual inputs.
We introduce the large-scale Visual CoT dataset comprising 438k question-answer pairs.
We propose a multi-turn processing pipeline that dynamically focuses on visual inputs and provides interpretable thoughts.
arXiv Detail & Related papers (2024-03-25T17:59:23Z) - Text2Analysis: A Benchmark of Table Question Answering with Advanced
Data Analysis and Unclear Queries [67.0083902913112]
We develop the Text2Analysis benchmark, incorporating advanced analysis tasks.
We also develop five innovative and effective annotation methods.
We evaluate five state-of-the-art models using three different metrics.
arXiv Detail & Related papers (2023-12-21T08:50:41Z) - ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model
for Visual Question Answering in Vietnamese [1.6340299456362617]
We introduce the ViCLEVR dataset, a pioneering collection for evaluating various visual reasoning capabilities in Vietnamese.
We conduct a comprehensive analysis of contemporary visual reasoning systems, offering valuable insights into their strengths and limitations.
We present PhoVIT, a comprehensive multimodal fusion that identifies objects in images based on questions.
arXiv Detail & Related papers (2023-10-27T10:44:50Z) - MapQA: A Dataset for Question Answering on Choropleth Maps [12.877773112674506]
We present MapQA, a large-scale dataset of 800K question-answer pairs over 60K map images.
Our task tests various levels of map understanding, from surface questions about map styles to complex questions that require reasoning on the underlying data.
We also present a novel algorithm, Visual Multi-Output Data Extraction based QA (V-MODEQA) for MapQA.
arXiv Detail & Related papers (2022-11-15T22:31:38Z) - Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language.
We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs.
We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.