Related papers: ConceptFusion: Open-set Multimodal 3D Mapping

ConceptFusion: Open-set Multimodal 3D Mapping

URL: http://arxiv.org/abs/2302.07241v3
Date: Mon, 23 Oct 2023 14:56:15 GMT
Title: ConceptFusion: Open-set Multimodal 3D Mapping
Authors: Krishna Murthy Jatavallabhula and Alihusein Kuwajerwala and Qiao Gu and Mohd Omama and Tao Chen and Alaa Maalouf and Shuang Li and Ganesh Iyer and Soroush Saryazdi and Nikhil Keetha and Ayush Tewari and Joshua B. Tenenbaum and Celso Miguel de Melo and Madhava Krishna and Liam Paull and Florian Shkurti and Antonio Torralba
Abstract summary: ConceptFusion is a scene representation that is fundamentally open-set. It enables reasoning beyond a closed set of concepts and inherently multimodal. We evaluate ConceptFusion on a number of real-world datasets.
Score: 91.23054486724402
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Building 3D maps of the environment is central to robot navigation, planning, and interaction with objects in a scene. Most existing approaches that integrate semantic concepts with 3D maps largely remain confined to the closed-set setting: they can only reason about a finite set of concepts, pre-defined at training time. Further, these maps can only be queried using class labels, or in recent work, using text prompts. We address both these issues with ConceptFusion, a scene representation that is (1) fundamentally open-set, enabling reasoning beyond a closed set of concepts and (ii) inherently multimodal, enabling a diverse range of possible queries to the 3D map, from language, to images, to audio, to 3D geometry, all working in concert. ConceptFusion leverages the open-set capabilities of today's foundation models pre-trained on internet-scale data to reason about concepts across modalities such as natural language, images, and audio. We demonstrate that pixel-aligned open-set features can be fused into 3D maps via traditional SLAM and multi-view fusion approaches. This enables effective zero-shot spatial reasoning, not needing any additional training or finetuning, and retains long-tailed concepts better than supervised approaches, outperforming them by more than 40% margin on 3D IoU. We extensively evaluate ConceptFusion on a number of real-world datasets, simulated home environments, a real-world tabletop manipulation task, and an autonomous driving platform. We showcase new avenues for blending foundation models with 3D open-set multimodal mapping. For more information, visit our project page https://concept-fusion.github.io or watch our 5-minute explainer video https://www.youtube.com/watch?v=rkXgws8fiDs

Related papers

Unified Semantic Transformer for 3D Scene Understanding [55.415468022487005]
We introduce UNITE, a novel feed-forward neural network that unifies a diverse set of 3D semantic tasks within a single model.<n>Our model operates on unseen scenes in a fully end-to-end manner and only takes a few seconds to infer the full 3D semantic geometry.<n>We demonstrate that UNITE achieves state-of-the-art performance on several different semantic tasks and even outperforms task-specific models.
arXiv Detail & Related papers (2025-12-16T12:49:35Z)
EA3D: Online Open-World 3D Object Extraction from Streaming Videos [55.48835711373918]
We present ExtractAnything3D (EA3D), a unified online framework for open-world 3D object extraction.<n>Given a streaming video, EA3D dynamically interprets each frame using vision-language and 2D vision foundation encoders to extract object-level knowledge.<n>A recurrent joint optimization module directs the model's attention to regions of interest, simultaneously enhancing both geometric reconstruction and semantic understanding.
arXiv Detail & Related papers (2025-10-29T03:56:41Z)
RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation [10.067978300536486]
We develop a zero-shot framework that seamlessly integrates GPU-accelerated geometric reconstruction with open-vocabulary vision-language models.<n>Our training-free system achieves superior performance through incremental processing and unified geometric-semantic updates.
arXiv Detail & Related papers (2025-05-21T11:07:25Z)
Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving [45.82124136705798]
DriveMonkey is a framework that seamlessly integrates Large Visual-Language Models with a spatial processor.<n>Our experiments show that DriveMonkey outperforms general LVLMs, especially achieving a 9.86% notable improvement on the 3D visual grounding task.
arXiv Detail & Related papers (2025-05-13T16:36:51Z)
Foundational Models for 3D Point Clouds: A Survey and Outlook [50.61473863985571]
3D point cloud representation plays a crucial role in preserving the geometric fidelity of the physical world. To bridge this gap, it becomes essential to incorporate multiple modalities. Foundation models (FMs) can seamlessly integrate and reason across these modalities.
arXiv Detail & Related papers (2025-01-30T18:59:43Z)
MultiDreamer3D: Multi-concept 3D Customization with Concept-Aware Diffusion Guidance [8.084345870645201]
MultiDreamer3D can generate coherent multi-concept 3D content in a divide-and-conquer manner. We show that MultiDreamer3D not only ensures object presence and preserves the distinct identities of each concept but also successfully handles complex cases such as property change or interaction.
arXiv Detail & Related papers (2025-01-23T08:02:59Z)
OpenOcc: Open Vocabulary 3D Scene Reconstruction via Occupancy Representation [30.76201018651464]
Traditional 3D scene understanding approaches rely on expensive labeled 3D datasets to train a model for a single task with supervision. We propose OpenOcc, a novel framework unifying the 3D scene reconstruction and open vocabulary understanding with neural radiance fields. We show that our approach achieves competitive performance in 3D scene understanding tasks, especially for small and long-tail objects.
arXiv Detail & Related papers (2024-03-18T13:53:48Z)
Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation [13.770613689032503]
Open-Fusion is a groundbreaking approach for real-time open-vocabulary 3D mapping and queryable scene representation. It harnesses the power of a pre-trained vision-language foundation model (VLFM) for open-set semantic comprehension. It delivers outstanding annotation-free 3D segmentation for open-vocabulary without necessitating additional 3D training.
arXiv Detail & Related papers (2023-10-05T21:57:36Z)
ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning [125.90002884194838]
ConceptGraphs is an open-vocabulary graph-structured representation for 3D scenes. It is built by leveraging 2D foundation models and fusing their output to 3D by multi-view association. We demonstrate the utility of this representation through a number of downstream planning tasks.
arXiv Detail & Related papers (2023-09-28T17:53:38Z)
Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language. We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z)
CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios. Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z)
CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP [19.66617835750012]
Training a 3D scene understanding model requires complicated human annotations. vision-language pre-training models (e.g., CLIP) have shown remarkable open-world reasoning properties. We propose directly transferring CLIP's feature space to 3D scene understanding model without any form of supervision.
arXiv Detail & Related papers (2023-03-08T17:30:58Z)
OpenScene: 3D Scene Understanding with Open Vocabularies [73.1411930820683]
Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision. We propose OpenScene, an alternative approach where a model predicts dense features for 3D scene points that are co-embedded with text and image pixels in CLIP feature space. This zero-shot approach enables task-agnostic training and open-vocabulary queries.
arXiv Detail & Related papers (2022-11-28T18:58:36Z)
Towards High-Fidelity Single-view Holistic Reconstruction of Indoor Scenes [50.317223783035075]
We present a new framework to reconstruct holistic 3D indoor scenes from single-view images. We propose an instance-aligned implicit function (InstPIFu) for detailed object reconstruction. Our code and model will be made publicly available.
arXiv Detail & Related papers (2022-07-18T14:54:57Z)
Disentangling 3D Prototypical Networks For Few-Shot Concept Learning [29.02523358573336]
We present neural architectures that disentangle RGB-D images into objects' shapes and styles and a map of the background scene. Our networks incorporate architectural biases that reflect the image formation process, 3D geometry of the world scene, and shape-style interplay.
arXiv Detail & Related papers (2020-11-06T14:08:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.