Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?
- URL: http://arxiv.org/abs/2505.11907v1
- Date: Sat, 17 May 2025 08:48:40 GMT
- Title: Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?
- Authors: Zihao Dongfang, Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, Danda Pani Paudel, Luc Van Gool, Kailun Yang, Xuming Hu,
- Abstract summary: We ask: Are multimodal large language models (MLLMs) ready for omnidirectional spatial reasoning?<n> OSR-Bench is the first benchmark specifically designed for this setting.<n>It includes over 153,000 diverse question-answer pairs grounded in high-fidelity panoramic indoor scene maps.<n>We evaluate eight state-of-the-art MLLMs, including GPT-4o, Gemini 1.5 Pro, and leading open-source models under zero-shot settings.
- Score: 66.88619941063048
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The 180x360 omnidirectional field of view captured by 360-degree cameras enables their use in a wide range of applications such as embodied AI and virtual reality. Although recent advances in multimodal large language models (MLLMs) have shown promise in visual-spatial reasoning, most studies focus on standard pinhole-view images, leaving omnidirectional perception largely unexplored. In this paper, we ask: Are MLLMs ready for omnidirectional spatial reasoning? To investigate this, we introduce OSR-Bench, the first benchmark specifically designed for this setting. OSR-Bench includes over 153,000 diverse question-answer pairs grounded in high-fidelity panoramic indoor scene maps. It covers key reasoning types including object counting, relative distance, and direction. We also propose a negative sampling strategy that inserts non-existent objects into prompts to evaluate hallucination and grounding robustness. For fine-grained analysis, we design a two-stage evaluation framework assessing both cognitive map generation and QA accuracy using rotation-invariant matching and a combination of rule-based and LLM-based metrics. We evaluate eight state-of-the-art MLLMs, including GPT-4o, Gemini 1.5 Pro, and leading open-source models under zero-shot settings. Results show that current models struggle with spatial reasoning in panoramic contexts, highlighting the need for more perceptually grounded MLLMs. OSR-Bench and code will be released at: https://huggingface.co/datasets/UUUserna/OSR-Bench
Related papers
- OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding [16.1559012200066]
We introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent.<n>Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes.<n>We find that both complex-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes.
arXiv Detail & Related papers (2025-07-10T17:56:07Z) - PlanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations [75.04864582433879]
PlanQA is a diagnostic benchmark for evaluating geometric and spatial reasoning in large-language models.<n>The benchmark uncovers diverse question types that test not only metric and topological reasoning but also interior design constraints.
arXiv Detail & Related papers (2025-07-10T11:16:48Z) - Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps [56.76175383189738]
We introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs.<n>ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates.<n> Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern.
arXiv Detail & Related papers (2025-05-24T12:33:52Z) - Towards Omnidirectional Reasoning with 360-R1: A Dataset, Benchmark, and GRPO-based Method [8.039453341761538]
We introduce OmniVQA, the first dataset and conduct the first benchmark for omnidirectional visual question answering.<n>Our evaluation of state-of-the-art MLLMs reveals significant limitations in handling omnidirectional visual question answering.<n>We introduce a rule-based reinforcement learning method, 360-R1, based on Qwen2.5-VL-Instruct.
arXiv Detail & Related papers (2025-05-20T10:55:26Z) - Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs [41.072699990427374]
Multi-view understanding is a fundamental challenge in Multi-Modal Large Language Models (MLLMs) to be used as embodied agents.<n>We propose All-Angles Bench, a benchmark of over 2,100 human carefully annotated multi-view question-answer pairs across 90 real-world scenes.<n>Our experiments, benchmark on 27 representative MLLMs including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o against human evaluators reveals a substantial performance gap.
arXiv Detail & Related papers (2025-04-21T17:59:53Z) - Open3DVQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space [41.18548960865975]
We propose a novel benchmark, Open3DVQA, to comprehensively evaluate the spatial reasoning capacities of state-of-the-art (SOTA) foundation models in open 3D space.<n>Open3DVQA consists of 9k VQA samples, collected using an efficient semi-automated tool in a high-fidelity urban simulator.
arXiv Detail & Related papers (2025-03-14T05:35:38Z) - MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.
It aims to localize instances of interest across multiple images based on open-ended text prompts.
We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z) - Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model [51.83436609094658]
We introduce Coarse Correspondences, a simple lightweight method that enhances MLLMs' spatial-temporal reasoning with 2D images as input.
Our method uses a lightweight tracking model to identify primary object correspondences between frames in a video or across different image viewpoints.
We demonstrate that this simple training-free approach brings substantial gains to GPT4-V/O consistently on four benchmarks.
arXiv Detail & Related papers (2024-08-01T17:57:12Z) - MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning [22.440669015518015]
We evaluate whether multi-modal large language models (MLLMs) possess abstract visual reasoning abilities.
Similar to the Sudoku puzzles, abstract visual reasoning (AVR) problems require finding high-level patterns.
We introduce MARVEL, a benchmark with 770 MLLMs composed of six core knowledge patterns, geometric and abstract shapes, and five different task configurations.
arXiv Detail & Related papers (2024-04-21T09:15:02Z) - Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs [71.07108539262721]
We design benchmark settings to emulate human language responses related to low-level vision.
We extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs.
We demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than humans.
arXiv Detail & Related papers (2024-02-11T06:44:11Z) - SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs.
SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions.
We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.