Related papers: Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images

Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images

URL: http://arxiv.org/abs/2511.07222v1
Date: Mon, 10 Nov 2025 15:44:48 GMT
Title: Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images
Authors: JiaKui Hu, Shanshan Zhao, Qing-Guo Chen, Xuerui Qiu, Jialun Liu, Zhao Xu, Weihua Luo, Kaifu Zhang, Yanye Lu,
Abstract summary: OmniView extends the unified understanding and generation of 3D scenes based on multi-view images.<n>It jointly models scene understanding, novel view synthesis, geometry estimation, enabling synergistic interaction between 3D scene understanding and generation tasks.
Score: 40.459573512775556
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents Omni-View, which extends the unified multimodal understanding and generation to 3D scenes based on multiview images, exploring the principle that "generation facilitates understanding". Consisting of understanding model, texture module, and geometry module, Omni-View jointly models scene understanding, novel view synthesis, and geometry estimation, enabling synergistic interaction between 3D scene understanding and generation tasks. By design, it leverages the spatiotemporal modeling capabilities of its texture module responsible for appearance synthesis, alongside the explicit geometric constraints provided by its dedicated geometry module, thereby enriching the model's holistic understanding of 3D scenes. Trained with a two-stage strategy, Omni-View achieves a state-of-the-art score of 55.4 on the VSI-Bench benchmark, outperforming existing specialized 3D understanding models, while simultaneously delivering strong performance in both novel view synthesis and 3D scene generation.

Related papers

Unified Semantic Transformer for 3D Scene Understanding [55.415468022487005]
We introduce UNITE, a novel feed-forward neural network that unifies a diverse set of 3D semantic tasks within a single model.<n>Our model operates on unseen scenes in a fully end-to-end manner and only takes a few seconds to infer the full 3D semantic geometry.<n>We demonstrate that UNITE achieves state-of-the-art performance on several different semantic tasks and even outperforms task-specific models.
arXiv Detail & Related papers (2025-12-16T12:49:35Z)
PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing [51.56943889042673]
PercHead is a method for single-image 3D head reconstruction and semantic 3D editing.<n>We develop a unified base model for reconstructing view-consistent 3D heads from a single input image.<n>We highlight the intuitive and powerful 3D editing capabilities of our model through a lightweight, interactive GUI.
arXiv Detail & Related papers (2025-11-04T17:59:15Z)
OmniScene: Attention-Augmented Multimodal 4D Scene Understanding for Autonomous Driving [21.143038784114154]
Human vision is capable of transforming two-dimensional observations into an egocentric three-dimensional scene understanding.<n>We propose a novel human-like framework called OmniScene to integrate multi-view and temporal perception for holistic 4D scene understanding.<n>Our approach consistently achieves superior results, establishing new benchmarks in perception, prediction, planning, and visual question answering.
arXiv Detail & Related papers (2025-09-24T10:28:06Z)
Evo-0: Vision-Language-Action Model with Implicit Spatial Understanding [11.222744122842023]
We introduce a plug-and-play module that implicitly incorporates 3D geometry features into Vision-Language-Action (VLA) models.<n>Our method significantly improves the performance of state-of-the-art VLA models across diverse scenarios.
arXiv Detail & Related papers (2025-07-01T04:05:47Z)
Video Perception Models for 3D Scene Synthesis [109.5543506037003]
VIPScene is a novel framework that exploits the encoded commonsense knowledge of the 3D physical world in video generation models.<n>VIPScene seamlessly integrates video generation, feedforward 3D reconstruction, and open-vocabulary perception models to semantically and geometrically analyze each object in a scene.
arXiv Detail & Related papers (2025-06-25T16:40:17Z)
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction [86.82819259860186]
We introduce VLM-3R, a unified framework for Vision-Language Models (VLMs) that incorporates 3D Reconstructive instruction tuning.<n>VLM-3R processes monocular video frames by employing a geometry encoder to derive implicit 3D tokens that represent spatial understanding.
arXiv Detail & Related papers (2025-05-26T17:56:30Z)
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis [63.169364481672915]
We propose textbfViewCrafter, a novel method for synthesizing high-fidelity novel views of generic scenes from single or sparse images. Our method takes advantage of the powerful generation capabilities of video diffusion model and the coarse 3D clues offered by point-based representation to generate high-quality video frames.
arXiv Detail & Related papers (2024-09-03T16:53:19Z)
VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding [47.58359136198136]
VisionGPT-3D provides a versatile multimodal framework building upon the strengths of multimodal foundation models. It seamlessly integrates various SOTA vision models and brings the automation in the selection of SOTA vision models. It identifies the suitable 3D mesh creation algorithms corresponding to 2D depth maps analysis, generates optimal results based on diverse multimodal inputs.
arXiv Detail & Related papers (2024-03-14T16:13:00Z)
AUTO3D: Novel view synthesis through unsupervisely learned variational viewpoint and global 3D representation [27.163052958878776]
This paper targets on learning-based novel view synthesis from a single or limited 2D images without the pose supervision. We construct an end-to-end trainable conditional variational framework to disentangle the unsupervisely learned relative-pose/rotation and implicit global 3D representation. Our system can achieve implicitly 3D understanding without explicitly 3D reconstruction.
arXiv Detail & Related papers (2020-07-13T18:51:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.