Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos
- URL: http://arxiv.org/abs/2507.22052v1
- Date: Tue, 29 Jul 2025 17:55:58 GMT
- Title: Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos
- Authors: Ziren Gong, Xiaohan Li, Fabio Tosi, Jiawei Han, Stefano Mattoccia, Jianfei Cai, Matteo Poggi,
- Abstract summary: Ov3R is a framework for semantic 3D reconstruction from RGB video streams.<n> CLIP3R predicts dense point maps from overlapping clips while embedding object-level semantics.<n>2D-3D OVS lifts 2D features into 3D by learning fused descriptors integrating spatial, geometric, and semantic cues.
- Score: 69.21508595833623
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Ov3R, a novel framework for open-vocabulary semantic 3D reconstruction from RGB video streams, designed to advance Spatial AI. The system features two key components: CLIP3R, a CLIP-informed 3D reconstruction module that predicts dense point maps from overlapping clips while embedding object-level semantics; and 2D-3D OVS, a 2D-3D open-vocabulary semantic module that lifts 2D features into 3D by learning fused descriptors integrating spatial, geometric, and semantic cues. Unlike prior methods, Ov3R incorporates CLIP semantics directly into the reconstruction process, enabling globally consistent geometry and fine-grained semantic alignment. Our framework achieves state-of-the-art performance in both dense 3D reconstruction and open-vocabulary 3D segmentation, marking a step forward toward real-time, semantics-aware Spatial AI.
Related papers
- RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation [10.067978300536486]
We develop a zero-shot framework that seamlessly integrates GPU-accelerated geometric reconstruction with open-vocabulary vision-language models.<n>Our training-free system achieves superior performance through incremental processing and unified geometric-semantic updates.
arXiv Detail & Related papers (2025-05-21T11:07:25Z) - PanoSLAM: Panoptic 3D Scene Reconstruction via Gaussian SLAM [105.01907579424362]
PanoSLAM is the first SLAM system to integrate geometric reconstruction, 3D semantic segmentation, and 3D instance segmentation within a unified framework.<n>For the first time, it achieves panoptic 3D reconstruction of open-world environments directly from the RGB-D video.
arXiv Detail & Related papers (2024-12-31T08:58:10Z) - Bootstraping Clustering of Gaussians for View-consistent 3D Scene Understanding [59.51535163599723]
FreeGS is an unsupervised semantic-embedded 3DGS framework that achieves view-consistent 3D scene understanding without the need for 2D labels.<n>FreeGS performs comparably to state-of-the-art methods while avoiding the complex data preprocessing workload.
arXiv Detail & Related papers (2024-11-29T08:52:32Z) - MOSE: Monocular Semantic Reconstruction Using NeRF-Lifted Noisy Priors [11.118490283303407]
We propose a neural field semantic reconstruction approach to lift inferred image-level noisy priors to 3D.
Our method produces accurate semantics and geometry in both 3D and 2D space.
arXiv Detail & Related papers (2024-09-21T05:12:13Z) - ALSTER: A Local Spatio-Temporal Expert for Online 3D Semantic
Reconstruction [62.599588577671796]
We propose an online 3D semantic segmentation method that incrementally reconstructs a 3D semantic map from a stream of RGB-D frames.
Unlike offline methods, ours is directly applicable to scenarios with real-time constraints, such as robotics or mixed reality.
arXiv Detail & Related papers (2023-11-29T20:30:18Z) - SSR-2D: Semantic 3D Scene Reconstruction from 2D Images [54.46126685716471]
In this work, we explore a central 3D scene modeling task, namely, semantic scene reconstruction without using any 3D annotations.
The key idea of our approach is to design a trainable model that employs both incomplete 3D reconstructions and their corresponding source RGB-D images.
Our method achieves the state-of-the-art performance of semantic scene completion on two large-scale benchmark datasets MatterPort3D and ScanNet.
arXiv Detail & Related papers (2023-02-07T17:47:52Z) - 3D Shape Reconstruction from 2D Images with Disentangled Attribute Flow [61.62796058294777]
Reconstructing 3D shape from a single 2D image is a challenging task.
Most of the previous methods still struggle to extract semantic attributes for 3D reconstruction task.
We propose 3DAttriFlow to disentangle and extract semantic attributes through different semantic levels in the input images.
arXiv Detail & Related papers (2022-03-29T02:03:31Z) - Semantic Dense Reconstruction with Consistent Scene Segments [33.0310121044956]
A method for dense semantic 3D scene reconstruction from an RGB-D sequence is proposed to solve high-level scene understanding tasks.
First, each RGB-D pair is consistently segmented into 2D semantic maps based on a camera tracking backbone.
A dense 3D mesh model of an unknown environment is incrementally generated from the input RGB-D sequence.
arXiv Detail & Related papers (2021-09-30T03:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.