Dense360: Dense Understanding from Omnidirectional Panoramas
- URL: http://arxiv.org/abs/2506.14471v1
- Date: Tue, 17 Jun 2025 12:35:23 GMT
- Title: Dense360: Dense Understanding from Omnidirectional Panoramas
- Authors: Yikang Zhou, Tao Zhang, Dizhe Zhang, Shunping Ji, Xiangtai Li, Lu Qi,
- Abstract summary: We introduce an omnidirectional panoramas dataset featuring a comprehensive suite of reliability-scored annotations.<n>Specifically, our dataset contains 160K panoramas with 5M dense entity-level captions, 1M unique referring expressions, and 100K entity-grounded panoramic scene descriptions.
- Score: 24.862817640267572
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) require comprehensive visual inputs to achieve dense understanding of the physical world. While existing MLLMs demonstrate impressive world understanding capabilities through limited field-of-view (FOV) visual inputs (e.g., 70 degree), we take the first step toward dense understanding from omnidirectional panoramas. We first introduce an omnidirectional panoramas dataset featuring a comprehensive suite of reliability-scored annotations. Specifically, our dataset contains 160K panoramas with 5M dense entity-level captions, 1M unique referring expressions, and 100K entity-grounded panoramic scene descriptions. Compared to multi-view alternatives, panoramas can provide more complete, compact, and continuous scene representations through equirectangular projections (ERP). However, the use of ERP introduces two key challenges for MLLMs: i) spatial continuity along the circle of latitude, and ii) latitude-dependent variation in information density. We address these challenges through ERP-RoPE, a position encoding scheme specifically designed for panoramic ERP. In addition, we introduce Dense360-Bench, the first benchmark for evaluating MLLMs on omnidirectional captioning and grounding, establishing a comprehensive framework for advancing dense visual-language understanding in panoramic settings.
Related papers
- ODI-Bench: Can MLLMs Understand Immersive Omnidirectional Environments? [86.42854691331713]
We first present ODI-Bench, a novel comprehensive benchmark specifically designed for omnidirectional image understanding.<n> Extensive experiments are conducted to benchmark 20 representative MLLMs, including proprietary and open-source models.<n>We further introduce Omni-CoT, a training-free method which significantly enhances MLLMs' comprehension ability in the omnidirectional environment.
arXiv Detail & Related papers (2025-10-13T15:51:47Z) - DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World [68.39362698871503]
We present DenseWorld-1M, the first massive, detailed, dense grounded caption dataset in the real world.<n>We design a three-stage labeling pipeline, containing open-world perception, detailed object caption generation, and dense caption merging.<n>To accelerate the labeling process and improve caption quality, we present two VLM models: the Detailed Region Caption model and the Spatial Caption Merging model.
arXiv Detail & Related papers (2025-06-30T17:51:25Z) - Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos [53.723410664944566]
We present Perceive Anything Model (PAM), a framework for comprehensive region-level visual understanding in images and videos.<n>Our approach extends the powerful segmentation model SAM 2 by integrating Large Language Models (LLMs), enabling simultaneous object segmentation.<n>A key component, Semantic Perceiver, is introduced to efficiently transform SAM 2's rich visual features into multi-modal tokens.
arXiv Detail & Related papers (2025-06-05T17:51:39Z) - Towards Omnidirectional Reasoning with 360-R1: A Dataset, Benchmark, and GRPO-based Method [8.039453341761538]
We introduce OmniVQA, the first dataset and conduct the first benchmark for omnidirectional visual question answering.<n>Our evaluation of state-of-the-art MLLMs reveals significant limitations in handling omnidirectional visual question answering.<n>We introduce a rule-based reinforcement learning method, 360-R1, based on Qwen2.5-VL-Instruct.
arXiv Detail & Related papers (2025-05-20T10:55:26Z) - Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning? [66.88619941063048]
We ask: Are multimodal large language models (MLLMs) ready for omnidirectional spatial reasoning?<n> OSR-Bench is the first benchmark specifically designed for this setting.<n>It includes over 153,000 diverse question-answer pairs grounded in high-fidelity panoramic indoor scene maps.<n>We evaluate eight state-of-the-art MLLMs, including GPT-4o, Gemini 1.5 Pro, and leading open-source models under zero-shot settings.
arXiv Detail & Related papers (2025-05-17T08:48:40Z) - DiffPano: Scalable and Consistent Text to Panorama Generation with Spherical Epipolar-Aware Diffusion [60.45000652592418]
We propose a novel text-driven panoramic generation framework, DiffPano, to achieve scalable, consistent, and diverse panoramic scene generation.
We show that DiffPano can generate consistent, diverse panoramic images with given unseen text descriptions and camera poses.
arXiv Detail & Related papers (2024-10-31T17:57:02Z) - DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception [43.387076189063556]
High-quality image-text datasets offer diverse visual elements and throughout image descriptions.
Current caption engines fall short in providing complete and accurate annotations.
We propose Perceptual Fusion, using a low-budget but highly effective caption engine for complete and accurate image descriptions.
arXiv Detail & Related papers (2024-07-11T08:48:06Z) - SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension [62.40482764691584]
We introduce SEED-Bench-2-Plus, a benchmark specifically designed for evaluating textbftext-rich visual comprehension of MLLMs.
Our benchmark comprises 2.3K multiple-choice questions with precise human annotations, spanning three broad categories: Charts, Maps, and Webs.
We conduct a thorough evaluation involving 34 prominent MLLMs and emphasize the current limitations of MLLMs in text-rich visual comprehension.
arXiv Detail & Related papers (2024-04-25T17:39:35Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We present the Draw-and-Understand framework, exploring how to integrate visual prompting understanding capabilities into Multimodal Large Language Models (MLLMs)<n>Visual prompts allow users to interact through multi-modal instructions, enhancing the models' interactivity and fine-grained image comprehension.<n>In this framework, we propose a general architecture adaptable to different pre-trained MLLMs, enabling it to recognize various types of visual prompts.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Bending Reality: Distortion-aware Transformers for Adapting to Panoramic
Semantic Segmentation [26.09267582056609]
Large quantities of expensive, pixel-wise annotations are crucial for success of robust panoramic segmentation models.
Distortions and the distinct image-feature distribution in 360-degree panoramas impede the transfer from the annotation-rich pinhole domain.
We learn object deformations and panoramic image distortions in Deformable Patch Embedding (DPE) and Deformable Deformable (DMLP) components.
Finally, we tie together shared semantics in pinhole- and panoramic feature embeddings by generating multi-scale prototype features.
arXiv Detail & Related papers (2022-03-02T23:00:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.