Omnidirectional Spatial Modeling from Correlated Panoramas
- URL: http://arxiv.org/abs/2509.02164v1
- Date: Tue, 02 Sep 2025 10:14:55 GMT
- Title: Omnidirectional Spatial Modeling from Correlated Panoramas
- Authors: Xinshen Zhang, Tongxi Fu, Xu Zheng,
- Abstract summary: Existing omnidirectional methods achieve scene understanding within a single frame while neglecting cross-frame correlated panoramas.<n>We introduce textbfCFpano, the textbffirst benchmark dataset dedicated to cross-frame correlated panoramas visual question answering.<n>We present methodname, a multi-modal large language model (MLLM) fine-tuned with Group Relative Policy Optimization ( GRPO) and a set of tailored reward functions for robust and consistent reasoning with cross-frame correlated panoramas.
- Score: 4.75637997496421
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Omnidirectional scene understanding is vital for various downstream applications, such as embodied AI, autonomous driving, and immersive environments, yet remains challenging due to geometric distortion and complex spatial relations in 360{\deg} imagery. Existing omnidirectional methods achieve scene understanding within a single frame while neglecting cross-frame correlated panoramas. To bridge this gap, we introduce \textbf{CFpano}, the \textbf{first} benchmark dataset dedicated to cross-frame correlated panoramas visual question answering in the holistic 360{\deg} scenes. CFpano consists of over 2700 images together with over 8000 question-answer pairs, and the question types include both multiple choice and open-ended VQA. Building upon our CFpano, we further present \methodname, a multi-modal large language model (MLLM) fine-tuned with Group Relative Policy Optimization (GRPO) and a set of tailored reward functions for robust and consistent reasoning with cross-frame correlated panoramas. Benchmark experiments with existing MLLMs are conducted with our CFpano. The experimental results demonstrate that \methodname achieves state-of-the-art performance across both multiple-choice and open-ended VQA tasks, outperforming strong baselines on all major reasoning categories (\textbf{+5.37\%} in overall performance). Our analyses validate the effectiveness of GRPO and establish a new benchmark for panoramic scene understanding.
Related papers
- CSMCIR: CoT-Enhanced Symmetric Alignment with Memory Bank for Composed Image Retrieval [54.15776146365823]
Composed Image Retrieval (CIR) enables users to search for target images using both a reference image and manipulation text.<n>We propose CSMCIR, a unified representation framework that achieves efficient query-target alignment through three synergistic components.
arXiv Detail & Related papers (2026-01-07T09:21:38Z) - Multi-label Classification with Panoptic Context Aggregation Networks [61.82285737410154]
This paper introduces the Deep Panoptic Context Aggregation Network (PanCAN), a novel approach that hierarchically integrates multi-order geometric contexts.<n>PanCAN learns multi-order neighborhood relationships at each scale by combining random walks with an attention mechanism.<n>Experiments on NUS-WIDE, PASCAL VOC,2007, and MS-COCO benchmarks demonstrate that PanCAN consistently achieves competitive results.
arXiv Detail & Related papers (2025-12-29T14:16:21Z) - Incomplete Multi-view Clustering via Hierarchical Semantic Alignment and Cooperative Completion [13.39263294343983]
This paper proposes a novel incomplete multi-view clustering framework based on Hierarchical Semantic Alignment and Cooperative Completion (HSACC)<n>HSACC achieves robust cross-view fusion through a dual-level semantic space design.<n> Experimental results demonstrate that HSACC significantly outperforms state-of-the-art methods on five benchmark datasets.
arXiv Detail & Related papers (2025-10-14T02:58:10Z) - DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training [76.82789568988557]
DiT360 is a DiT-based framework that performs hybrid training on perspective and panoramic data for panoramic image generation.<n>Our method achieves better boundary consistency and image fidelity across eleven quantitative metrics.
arXiv Detail & Related papers (2025-10-13T17:59:15Z) - Towards Omnidirectional Reasoning with 360-R1: A Dataset, Benchmark, and GRPO-based Method [8.039453341761538]
We introduce OmniVQA, the first dataset and conduct the first benchmark for omnidirectional visual question answering.<n>Our evaluation of state-of-the-art MLLMs reveals significant limitations in handling omnidirectional visual question answering.<n>We introduce a rule-based reinforcement learning method, 360-R1, based on Qwen2.5-VL-Instruct.
arXiv Detail & Related papers (2025-05-20T10:55:26Z) - Dependency Structure Augmented Contextual Scoping Framework for Multimodal Aspect-Based Sentiment Analysis [9.561100210295699]
Multimodal Aspect-Based Sentiment Analysis (MABSA) seeks to extract fine-grained information from image-text pairs.<n>DASCO is a fine-grained scope-oriented framework that enhances aspect-level sentiment reasoning by leveraging dependency parsing trees.<n>Experiments on two benchmark datasets demonstrate that DASCO achieves state-of-the-art performance in MABSA.
arXiv Detail & Related papers (2025-04-15T16:05:09Z) - PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs [10.970010947605289]
Panoramic Image Generation (PIG) aims to create coherent images of arbitrary lengths.<n>We propose PanoLlama, a novel framework that achieves endless and coherent panorama generation with the autoregressive paradigm.
arXiv Detail & Related papers (2024-11-24T15:06:57Z) - More than the Sum of Its Parts: Ensembling Backbone Networks for
Few-Shot Segmentation [49.090592800481616]
We study whether fusing features from different backbones can improve the ability of acrlongfss models to capture richer visual features.
We propose and compare two ensembling techniques-Independent Voting and Feature Fusion.
Our approach outperforms the original single-backbone PANet across standard benchmarks even in challenging one-shot learning scenarios.
arXiv Detail & Related papers (2024-02-09T18:01:15Z) - 360 Layout Estimation via Orthogonal Planes Disentanglement and Multi-view Geometric Consistency Perception [56.84921040837699]
Existing panoramic layout estimation solutions tend to recover room boundaries from a vertically compressed sequence, yielding imprecise results.
We propose an orthogonal plane disentanglement network (termed DOPNet) to distinguish ambiguous semantics.
We also present an unsupervised adaptation technique tailored for horizon-depth and ratio representations.
Our solution outperforms other SoTA models on both monocular layout estimation and multi-view layout estimation tasks.
arXiv Detail & Related papers (2023-12-26T12:16:03Z) - Multi-Spectral Image Stitching via Spatial Graph Reasoning [52.27796682972484]
We propose a spatial graph reasoning based multi-spectral image stitching method.
We embed multi-scale complementary features from the same view position into a set of nodes.
By introducing long-range coherence along spatial and channel dimensions, the complementarity of pixel relations and channel interdependencies aids in the reconstruction of aligned multi-view features.
arXiv Detail & Related papers (2023-07-31T15:04:52Z) - Graph-CoVis: GNN-based Multi-view Panorama Global Pose Estimation [11.8322612639007]
Graph-CoVis is a novel Graph Neural Network based architecture that jointly learns the co-visible structure and global motion.
We show that our model performs competitively to state-of-the-art approaches.
arXiv Detail & Related papers (2023-04-26T00:04:50Z) - Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth
Estimation in Dynamic Scenes [51.20150148066458]
We propose a novel method to learn to fuse the multi-view and monocular cues encoded as volumes without needing the generalizationally crafted masks.
Experiments on real-world datasets prove the significant effectiveness and ability of the proposed method.
arXiv Detail & Related papers (2023-04-18T13:55:24Z) - Capturing Omni-Range Context for Omnidirectional Segmentation [29.738065412097598]
We introduce Concurrent Attention Networks (ECANets) to bridge the gap in terms of FoV and structural distribution between the imaging domains.
We upgrade model training by leveraging multi-source and omni-supervised learning, taking advantage of both: Densely labeled and unlabeled data.
Our novel model, training regimen and multisource prediction fusion elevate the performance (mIoU) to new state-of-the-art results.
arXiv Detail & Related papers (2021-03-09T19:46:09Z) - Panoramic Panoptic Segmentation: Towards Complete Surrounding
Understanding via Unsupervised Contrastive Learning [97.37544023666833]
We introduce panoramic panoptic segmentation as the most holistic scene understanding.
A complete surrounding understanding provides a maximum of information to the agent.
We propose a framework which allows model training on standard pinhole images and transfers the learned features to a different domain.
arXiv Detail & Related papers (2021-03-01T09:37:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.