Related papers: Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?

Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?

URL: http://arxiv.org/abs/2505.11907v1
Date: Sat, 17 May 2025 08:48:40 GMT
Title: Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?
Authors: Zihao Dongfang, Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, Danda Pani Paudel, Luc Van Gool, Kailun Yang, Xuming Hu,
Abstract summary: We ask: Are multimodal large language models (MLLMs) ready for omnidirectional spatial reasoning?<n> OSR-Bench is the first benchmark specifically designed for this setting.<n>It includes over 153,000 diverse question-answer pairs grounded in high-fidelity panoramic indoor scene maps.<n>We evaluate eight state-of-the-art MLLMs, including GPT-4o, Gemini 1.5 Pro, and leading open-source models under zero-shot settings.
Score: 66.88619941063048
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The 180x360 omnidirectional field of view captured by 360-degree cameras enables their use in a wide range of applications such as embodied AI and virtual reality. Although recent advances in multimodal large language models (MLLMs) have shown promise in visual-spatial reasoning, most studies focus on standard pinhole-view images, leaving omnidirectional perception largely unexplored. In this paper, we ask: Are MLLMs ready for omnidirectional spatial reasoning? To investigate this, we introduce OSR-Bench, the first benchmark specifically designed for this setting. OSR-Bench includes over 153,000 diverse question-answer pairs grounded in high-fidelity panoramic indoor scene maps. It covers key reasoning types including object counting, relative distance, and direction. We also propose a negative sampling strategy that inserts non-existent objects into prompts to evaluate hallucination and grounding robustness. For fine-grained analysis, we design a two-stage evaluation framework assessing both cognitive map generation and QA accuracy using rotation-invariant matching and a combination of rule-based and LLM-based metrics. We evaluate eight state-of-the-art MLLMs, including GPT-4o, Gemini 1.5 Pro, and leading open-source models under zero-shot settings. Results show that current models struggle with spatial reasoning in panoramic contexts, highlighting the need for more perceptually grounded MLLMs. OSR-Bench and code will be released at: https://huggingface.co/datasets/UUUserna/OSR-Bench

Related papers

OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding [16.1559012200066]
We introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent.<n>Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes.<n>We find that both complex-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes.
arXiv Detail & Related papers (2025-07-10T17:56:07Z)
PlanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations [75.04864582433879]
PlanQA is a diagnostic benchmark for evaluating geometric and spatial reasoning in large-language models.<n>The benchmark uncovers diverse question types that test not only metric and topological reasoning but also interior design constraints.
arXiv Detail & Related papers (2025-07-10T11:16:48Z)
Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps [56.76175383189738]
We introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs.<n>ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates.<n> Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern.
arXiv Detail & Related papers (2025-05-24T12:33:52Z)
Towards Omnidirectional Reasoning with 360-R1: A Dataset, Benchmark, and GRPO-based Method [8.039453341761538]
We introduce OmniVQA, the first dataset and conduct the first benchmark for omnidirectional visual question answering.<n>Our evaluation of state-of-the-art MLLMs reveals significant limitations in handling omnidirectional visual question answering.<n>We introduce a rule-based reinforcement learning method, 360-R1, based on Qwen2.5-VL-Instruct.
arXiv Detail & Related papers (2025-05-20T10:55:26Z)
Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs [41.072699990427374]
Multi-view understanding is a fundamental challenge in Multi-Modal Large Language Models (MLLMs) to be used as embodied agents.<n>We propose All-Angles Bench, a benchmark of over 2,100 human carefully annotated multi-view question-answer pairs across 90 real-world scenes.<n>Our experiments, benchmark on 27 representative MLLMs including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o against human evaluators reveals a substantial performance gap.
arXiv Detail & Related papers (2025-04-21T17:59:53Z)
Open3DVQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space [41.18548960865975]
We propose a novel benchmark, Open3DVQA, to comprehensively evaluate the spatial reasoning capacities of state-of-the-art (SOTA) foundation models in open 3D space.<n>Open3DVQA consists of 9k VQA samples, collected using an efficient semi-automated tool in a high-fidelity urban simulator.
arXiv Detail & Related papers (2025-03-14T05:35:38Z)
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding. It aims to localize instances of interest across multiple images based on open-ended text prompts. We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z)
Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model [51.83436609094658]
We introduce Coarse Correspondences, a simple lightweight method that enhances MLLMs' spatial-temporal reasoning with 2D images as input. Our method uses a lightweight tracking model to identify primary object correspondences between frames in a video or across different image viewpoints. We demonstrate that this simple training-free approach brings substantial gains to GPT4-V/O consistently on four benchmarks.
arXiv Detail & Related papers (2024-08-01T17:57:12Z)
MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning [22.440669015518015]
We evaluate whether multi-modal large language models (MLLMs) possess abstract visual reasoning abilities. Similar to the Sudoku puzzles, abstract visual reasoning (AVR) problems require finding high-level patterns. We introduce MARVEL, a benchmark with 770 MLLMs composed of six core knowledge patterns, geometric and abstract shapes, and five different task configurations.
arXiv Detail & Related papers (2024-04-21T09:15:02Z)
Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs [71.07108539262721]
We design benchmark settings to emulate human language responses related to low-level vision. We extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs. We demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than humans.
arXiv Detail & Related papers (2024-02-11T06:44:11Z)
SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs. SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions. We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.