ODI-Bench: Can MLLMs Understand Immersive Omnidirectional Environments?
- URL: http://arxiv.org/abs/2510.11549v1
- Date: Mon, 13 Oct 2025 15:51:47 GMT
- Title: ODI-Bench: Can MLLMs Understand Immersive Omnidirectional Environments?
- Authors: Liu Yang, Huiyu Duan, Ran Tao, Juntao Cheng, Sijing Wu, Yunhao Li, Jing Liu, Xiongkuo Min, Guangtao Zhai,
- Abstract summary: We first present ODI-Bench, a novel comprehensive benchmark specifically designed for omnidirectional image understanding.<n> Extensive experiments are conducted to benchmark 20 representative MLLMs, including proprietary and open-source models.<n>We further introduce Omni-CoT, a training-free method which significantly enhances MLLMs' comprehension ability in the omnidirectional environment.
- Score: 86.42854691331713
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Omnidirectional images (ODIs) provide full 360x180 view which are widely adopted in VR, AR and embodied intelligence applications. While multi-modal large language models (MLLMs) have demonstrated remarkable performance on conventional 2D image and video understanding benchmarks, their ability to comprehend the immersive environments captured by ODIs remains largely unexplored. To address this gap, we first present ODI-Bench, a novel comprehensive benchmark specifically designed for omnidirectional image understanding. ODI-Bench contains 2,000 high-quality omnidirectional images and over 4,000 manually annotated question-answering (QA) pairs across 10 fine-grained tasks, covering both general-level and spatial-level ODI understanding. Extensive experiments are conducted to benchmark 20 representative MLLMs, including proprietary and open-source models, under both close-ended and open-ended settings. Experimental results reveal that current MLLMs still struggle to capture the immersive context provided by ODIs. To this end, we further introduce Omni-CoT, a training-free method which significantly enhances MLLMs' comprehension ability in the omnidirectional environment through chain-of-thought reasoning across both textual information and visual cues. Both the benchmark and the code will be released upon the publication.
Related papers
- OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs [72.425061028374]
We introduce OmniVideoBench, a benchmark dedicated to assessing synergistic audio-visual understanding.<n> OmniVideoBench comprises 1000 high-quality question-answer(QA) pairs, each annotated with step-by-step reasoning traces.<n>We will release OmniVideoBench to foster the development of MLLMs with stronger and more generalizable reasoning capabilities.
arXiv Detail & Related papers (2025-10-12T16:34:00Z) - OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides Understanding [26.45873982159107]
We present OIG-Bench, a benchmark focused on One-Image Guide understanding across diverse domains.<n>We have conducted a comprehensive evaluation of 29 state-of-the-art MLLMs, including both proprietary and open-source models.<n>Results show that Qwen2.5-VL-72B performs the best among the evaluated models, with an overall accuracy of 77%.
arXiv Detail & Related papers (2025-09-29T15:44:08Z) - Dense360: Dense Understanding from Omnidirectional Panoramas [24.862817640267572]
We introduce an omnidirectional panoramas dataset featuring a comprehensive suite of reliability-scored annotations.<n>Specifically, our dataset contains 160K panoramas with 5M dense entity-level captions, 1M unique referring expressions, and 100K entity-grounded panoramic scene descriptions.
arXiv Detail & Related papers (2025-06-17T12:35:23Z) - Towards Omnidirectional Reasoning with 360-R1: A Dataset, Benchmark, and GRPO-based Method [8.039453341761538]
We introduce OmniVQA, the first dataset and conduct the first benchmark for omnidirectional visual question answering.<n>Our evaluation of state-of-the-art MLLMs reveals significant limitations in handling omnidirectional visual question answering.<n>We introduce a rule-based reinforcement learning method, 360-R1, based on Qwen2.5-VL-Instruct.
arXiv Detail & Related papers (2025-05-20T10:55:26Z) - Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning? [66.88619941063048]
We ask: Are multimodal large language models (MLLMs) ready for omnidirectional spatial reasoning?<n> OSR-Bench is the first benchmark specifically designed for this setting.<n>It includes over 153,000 diverse question-answer pairs grounded in high-fidelity panoramic indoor scene maps.<n>We evaluate eight state-of-the-art MLLMs, including GPT-4o, Gemini 1.5 Pro, and leading open-source models under zero-shot settings.
arXiv Detail & Related papers (2025-05-17T08:48:40Z) - Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs [41.072699990427374]
Multi-view understanding is a fundamental challenge in Multi-Modal Large Language Models (MLLMs) to be used as embodied agents.<n>We propose All-Angles Bench, a benchmark of over 2,100 human carefully annotated multi-view question-answer pairs across 90 real-world scenes.<n>Our experiments, benchmark on 27 representative MLLMs including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o against human evaluators reveals a substantial performance gap.
arXiv Detail & Related papers (2025-04-21T17:59:53Z) - MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.<n>It aims to localize instances of interest across multiple images based on open-ended text prompts.<n>We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [61.143381152739046]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.<n>Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.<n>We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.