Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene
Representation
- URL: http://arxiv.org/abs/2310.03923v1
- Date: Thu, 5 Oct 2023 21:57:36 GMT
- Title: Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene
Representation
- Authors: Kashu Yamazaki, Taisei Hanyu, Khoa Vo, Thang Pham, Minh Tran,
Gianfranco Doretto, Anh Nguyen, Ngan Le
- Abstract summary: Open-Fusion is a groundbreaking approach for real-time open-vocabulary 3D mapping and queryable scene representation.
It harnesses the power of a pre-trained vision-language foundation model (VLFM) for open-set semantic comprehension.
It delivers outstanding annotation-free 3D segmentation for open-vocabulary without necessitating additional 3D training.
- Score: 13.770613689032503
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Precise 3D environmental mapping is pivotal in robotics. Existing methods
often rely on predefined concepts during training or are time-intensive when
generating semantic maps. This paper presents Open-Fusion, a groundbreaking
approach for real-time open-vocabulary 3D mapping and queryable scene
representation using RGB-D data. Open-Fusion harnesses the power of a
pre-trained vision-language foundation model (VLFM) for open-set semantic
comprehension and employs the Truncated Signed Distance Function (TSDF) for
swift 3D scene reconstruction. By leveraging the VLFM, we extract region-based
embeddings and their associated confidence maps. These are then integrated with
3D knowledge from TSDF using an enhanced Hungarian-based feature-matching
mechanism. Notably, Open-Fusion delivers outstanding annotation-free 3D
segmentation for open-vocabulary without necessitating additional 3D training.
Benchmark tests on the ScanNet dataset against leading zero-shot methods
highlight Open-Fusion's superiority. Furthermore, it seamlessly combines the
strengths of region-based VLFM and TSDF, facilitating real-time 3D scene
comprehension that includes object concepts and open-world semantics. We
encourage the readers to view the demos on our project page:
https://uark-aicv.github.io/OpenFusion
Related papers
- OpenGS-Fusion: Open-Vocabulary Dense Mapping with Hybrid 3D Gaussian Splatting for Refined Object-Level Understanding [17.524454394142477]
We present OpenGS-Fusion, an innovative open-vocabulary dense mapping framework that improves semantic modeling and refines object-level understanding.<n>We also introduce a novel multimodal language-guided approach named MLLM-Assisted Adaptive Thresholding, which refines the segmentation of 3D objects by adaptively adjusting similarity thresholds.<n>Our method outperforms existing methods in 3D object understanding and scene reconstruction quality, as well as showcasing its effectiveness in language-guided scene interaction.
arXiv Detail & Related papers (2025-08-02T02:22:36Z) - OpenFusion++: An Open-vocabulary Real-time Scene Understanding System [4.470499157873342]
We present OpenFusion++, a TSDF-based real-time 3D semantic-geometric reconstruction system.
Our approach refines 3D point clouds by fusing confidence maps from foundational models, dynamically updates global semantic labels via an adaptive cache based on instance area, and employs a dual-path encoding framework.
Experiments on the ICL, Replica, ScanNet, and ScanNet++ datasets demonstrate that OpenFusion++ significantly outperforms the baseline in both semantic accuracy and query responsiveness.
arXiv Detail & Related papers (2025-04-27T14:46:43Z) - Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D [68.23391872643268]
LOCATE 3D is a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp"
It operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices.
arXiv Detail & Related papers (2025-04-19T02:51:24Z) - VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding [57.04804711488706]
3D visual grounding is crucial for robots, requiring integration of natural language and 3D scene understanding.
We present VLM-Grounder, a novel framework using vision-language models (VLMs) for zero-shot 3D visual grounding based solely on 2D images.
arXiv Detail & Related papers (2024-10-17T17:59:55Z) - OpenSU3D: Open World 3D Scene Understanding using Foundation Models [2.1262749936758216]
We present a novel, scalable approach for constructing open set, instance-level 3D scene representations.
Existing methods require pre-constructed 3D scenes and face scalability issues due to per-point feature vector learning.
We evaluate our proposed approach on multiple scenes from ScanNet and Replica datasets demonstrating zero-shot generalization capabilities.
arXiv Detail & Related papers (2024-07-19T13:01:12Z) - OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary Understanding [54.981605111365056]
This paper introduces OpenGaussian, a method based on 3D Gaussian Splatting (3DGS) capable of 3D point-level open vocabulary understanding.
Our primary motivation stems from observing that existing 3DGS-based open vocabulary methods mainly focus on 2D pixel-level parsing.
arXiv Detail & Related papers (2024-06-04T07:42:33Z) - Open-Vocabulary SAM3D: Towards Training-free Open-Vocabulary 3D Scene Understanding [41.96929575241655]
We introduce OV-SAM3D, a training-free method for understanding open-vocabulary 3D scenes.
This framework is designed to perform understanding tasks for any 3D scene without requiring prior knowledge of the scene.
Empirical evaluations on the ScanNet200 and nuScenes datasets demonstrate that our approach surpasses existing open-vocabulary methods in unknown open-world environments.
arXiv Detail & Related papers (2024-05-24T14:07:57Z) - OpenNeRF: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views [90.71215823587875]
We propose OpenNeRF which naturally operates on posed images and directly encodes the VLM features within the NeRF.
Our work shows that using pixel-wise VLM features results in an overall less complex architecture without the need for additional DINO regularization.
For 3D point cloud segmentation on the Replica dataset, OpenNeRF outperforms recent open-vocabulary methods such as LERF and OpenScene by at least +4.9 mIoU.
arXiv Detail & Related papers (2024-04-04T17:59:08Z) - Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships [15.513180297629546]
We present Open3DSG, an alternative approach to learn 3D scene graph prediction in an open world without requiring labeled scene graph data.
We co-embed the features from a 3D scene graph prediction backbone with the feature space of powerful open world 2D vision language foundation models.
arXiv Detail & Related papers (2024-02-19T16:15:03Z) - ConceptFusion: Open-set Multimodal 3D Mapping [91.23054486724402]
ConceptFusion is a scene representation that is fundamentally open-set.
It enables reasoning beyond a closed set of concepts and inherently multimodal.
We evaluate ConceptFusion on a number of real-world datasets.
arXiv Detail & Related papers (2023-02-14T18:40:26Z) - Diffusion-SDF: Text-to-Shape via Voxelized Diffusion [90.85011923436593]
We propose a new generative 3D modeling framework called Diffusion-SDF for the challenging task of text-to-shape synthesis.
We show that Diffusion-SDF generates both higher quality and more diversified 3D shapes that conform well to given text descriptions.
arXiv Detail & Related papers (2022-12-06T19:46:47Z) - OpenScene: 3D Scene Understanding with Open Vocabularies [73.1411930820683]
Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision.
We propose OpenScene, an alternative approach where a model predicts dense features for 3D scene points that are co-embedded with text and image pixels in CLIP feature space.
This zero-shot approach enables task-agnostic training and open-vocabulary queries.
arXiv Detail & Related papers (2022-11-28T18:58:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.