TopViewRS: Vision-Language Models as Top-View Spatial Reasoners
- URL: http://arxiv.org/abs/2406.02537v1
- Date: Tue, 4 Jun 2024 17:55:43 GMT
- Title: TopViewRS: Vision-Language Models as Top-View Spatial Reasoners
- Authors: Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, Ivan Vulić,
- Abstract summary: Top-view perspective denotes a typical way in which humans read and reason over different types of maps.
We introduce the TopViewRS dataset, consisting of 11,384 multiple-choice questions with either realistic or semantic top-view map as visual input.
We then use it to study and evaluate VLMs across 4 perception and reasoning tasks with different levels of complexity.
- Score: 38.406430696146714
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Top-view perspective denotes a typical way in which humans read and reason over different types of maps, and it is vital for localization and navigation of humans as well as of `non-human' agents, such as the ones backed by large Vision-Language Models (VLMs). Nonetheless, spatial reasoning capabilities of modern VLMs remain unattested and underexplored. In this work, we thus study their capability to understand and reason over spatial relations from the top view. The focus on top view also enables controlled evaluations at different granularity of spatial reasoning; we clearly disentangle different abilities (e.g., recognizing particular objects versus understanding their relative positions). We introduce the TopViewRS (Top-View Reasoning in Space) dataset, consisting of 11,384 multiple-choice questions with either realistic or semantic top-view map as visual input. We then use it to study and evaluate VLMs across 4 perception and reasoning tasks with different levels of complexity. Evaluation of 10 representative open- and closed-source VLMs reveals the gap of more than 50% compared to average human performance, and it is even lower than the random baseline in some cases. Although additional experiments show that Chain-of-Thought reasoning can boost model capabilities by 5.82% on average, the overall performance of VLMs remains limited. Our findings underscore the critical need for enhanced model capability in top-view spatial reasoning and set a foundation for further research towards human-level proficiency of VLMs in real-world multimodal tasks.
Related papers
- VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs [102.36953558562436]
Vision language models (VLMs) are an exciting emerging class of language models (LMs)
One understudied capability inVLMs is visual spatial planning.
Our study introduces a benchmark that evaluates the spatial planning capability in these models in general.
arXiv Detail & Related papers (2024-07-02T00:24:01Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models [26.839159541015597]
We develop novel benchmarks that cover diverse aspects of spatial reasoning.
Our findings reveal several counter-intuitive insights that have been overlooked in the literature.
We hope our study will inform the development of multimodal models to improve spatial intelligence.
arXiv Detail & Related papers (2024-06-21T03:53:37Z) - GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs [3.2688425993442696]
The ability to understand and reason about spatial relationships between objects in images is an important component of visual reasoning.
We extend the previously released What'sUp dataset and propose a novel comprehensive evaluation for spatial relationship understanding.
arXiv Detail & Related papers (2024-06-19T06:15:26Z) - WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences [122.87483437694706]
We launch WildVision-Arena (WV-Arena), an online platform that collects human preferences to evaluate vision-language models (VLMs)
WV-Bench uses GPT-4 as the judge to compare each VLM with Claude-3-Sonnet, achieving a Spearman correlation of 0.94 with the WV-Arena Elo.
Our comprehensive analysis of 20K real-world interactions reveals important insights into the failure cases of top-performing VLMs.
arXiv Detail & Related papers (2024-06-16T20:53:25Z) - ReMI: A Dataset for Reasoning with Multiple Images [41.954830849939526]
We introduce ReMI, a dataset designed to assess large language models' ability to Reason with Multiple Images.
This dataset encompasses a diverse range of tasks, spanning various reasoning domains such as math, physics, logic, code, table/chart understanding, and spatial and temporal reasoning.
We have benchmarked several cutting-edge LLMs and found a substantial gap between their performance and human-level proficiency.
arXiv Detail & Related papers (2024-06-13T14:37:04Z) - MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning [22.440669015518015]
We evaluate whether multi-modal large language models (MLLMs) possess abstract visual reasoning abilities.
Similar to the Sudoku puzzles, abstract visual reasoning (AVR) problems require finding high-level patterns.
We introduce MARVEL, a benchmark with 770 MLLMs composed of six core knowledge patterns, geometric and abstract shapes, and five different task configurations.
arXiv Detail & Related papers (2024-04-21T09:15:02Z) - SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning
Capabilities [59.39858959066982]
understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics.
We develop an automatic 3D spatial VQA data generation framework that scales up to 2 billion VQA examples on 10 million real-world images.
By training a VLM on such data, we significantly enhance its ability on both qualitative and quantitative spatial VQA.
arXiv Detail & Related papers (2024-01-22T18:01:01Z) - Q-Instruct: Improving Low-level Visual Abilities for Multi-modality
Foundation Models [81.20804369985376]
We conduct a large-scale subjective experiment collecting a vast number of real human feedbacks on low-level vision.
The constructed **Q-Pathway** dataset includes 58K detailed human feedbacks on 18,973 images.
We design a GPT-participated conversion to process these feedbacks into diverse-format 200K instruction-response pairs.
arXiv Detail & Related papers (2023-11-12T09:10:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.