ConceptFusion: Open-set Multimodal 3D Mapping
- URL: http://arxiv.org/abs/2302.07241v3
- Date: Mon, 23 Oct 2023 14:56:15 GMT
- Title: ConceptFusion: Open-set Multimodal 3D Mapping
- Authors: Krishna Murthy Jatavallabhula and Alihusein Kuwajerwala and Qiao Gu
and Mohd Omama and Tao Chen and Alaa Maalouf and Shuang Li and Ganesh Iyer
and Soroush Saryazdi and Nikhil Keetha and Ayush Tewari and Joshua B.
Tenenbaum and Celso Miguel de Melo and Madhava Krishna and Liam Paull and
Florian Shkurti and Antonio Torralba
- Abstract summary: ConceptFusion is a scene representation that is fundamentally open-set.
It enables reasoning beyond a closed set of concepts and inherently multimodal.
We evaluate ConceptFusion on a number of real-world datasets.
- Score: 91.23054486724402
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Building 3D maps of the environment is central to robot navigation, planning,
and interaction with objects in a scene. Most existing approaches that
integrate semantic concepts with 3D maps largely remain confined to the
closed-set setting: they can only reason about a finite set of concepts,
pre-defined at training time. Further, these maps can only be queried using
class labels, or in recent work, using text prompts.
We address both these issues with ConceptFusion, a scene representation that
is (1) fundamentally open-set, enabling reasoning beyond a closed set of
concepts and (ii) inherently multimodal, enabling a diverse range of possible
queries to the 3D map, from language, to images, to audio, to 3D geometry, all
working in concert. ConceptFusion leverages the open-set capabilities of
today's foundation models pre-trained on internet-scale data to reason about
concepts across modalities such as natural language, images, and audio. We
demonstrate that pixel-aligned open-set features can be fused into 3D maps via
traditional SLAM and multi-view fusion approaches. This enables effective
zero-shot spatial reasoning, not needing any additional training or finetuning,
and retains long-tailed concepts better than supervised approaches,
outperforming them by more than 40% margin on 3D IoU. We extensively evaluate
ConceptFusion on a number of real-world datasets, simulated home environments,
a real-world tabletop manipulation task, and an autonomous driving platform. We
showcase new avenues for blending foundation models with 3D open-set multimodal
mapping.
For more information, visit our project page https://concept-fusion.github.io
or watch our 5-minute explainer video
https://www.youtube.com/watch?v=rkXgws8fiDs
Related papers
- OpenOcc: Open Vocabulary 3D Scene Reconstruction via Occupancy Representation [30.76201018651464]
Traditional 3D scene understanding approaches rely on expensive labeled 3D datasets to train a model for a single task with supervision.
We propose OpenOcc, a novel framework unifying the 3D scene reconstruction and open vocabulary understanding with neural radiance fields.
We show that our approach achieves competitive performance in 3D scene understanding tasks, especially for small and long-tail objects.
arXiv Detail & Related papers (2024-03-18T13:53:48Z) - Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene
Representation [13.770613689032503]
Open-Fusion is a groundbreaking approach for real-time open-vocabulary 3D mapping and queryable scene representation.
It harnesses the power of a pre-trained vision-language foundation model (VLFM) for open-set semantic comprehension.
It delivers outstanding annotation-free 3D segmentation for open-vocabulary without necessitating additional 3D training.
arXiv Detail & Related papers (2023-10-05T21:57:36Z) - ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and
Planning [125.90002884194838]
ConceptGraphs is an open-vocabulary graph-structured representation for 3D scenes.
It is built by leveraging 2D foundation models and fusing their output to 3D by multi-view association.
We demonstrate the utility of this representation through a number of downstream planning tasks.
arXiv Detail & Related papers (2023-09-28T17:53:38Z) - Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language.
We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z) - CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World
Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios.
Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z) - CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D
Dense CLIP [19.66617835750012]
Training a 3D scene understanding model requires complicated human annotations.
vision-language pre-training models (e.g., CLIP) have shown remarkable open-world reasoning properties.
We propose directly transferring CLIP's feature space to 3D scene understanding model without any form of supervision.
arXiv Detail & Related papers (2023-03-08T17:30:58Z) - OpenScene: 3D Scene Understanding with Open Vocabularies [73.1411930820683]
Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision.
We propose OpenScene, an alternative approach where a model predicts dense features for 3D scene points that are co-embedded with text and image pixels in CLIP feature space.
This zero-shot approach enables task-agnostic training and open-vocabulary queries.
arXiv Detail & Related papers (2022-11-28T18:58:36Z) - Towards High-Fidelity Single-view Holistic Reconstruction of Indoor
Scenes [50.317223783035075]
We present a new framework to reconstruct holistic 3D indoor scenes from single-view images.
We propose an instance-aligned implicit function (InstPIFu) for detailed object reconstruction.
Our code and model will be made publicly available.
arXiv Detail & Related papers (2022-07-18T14:54:57Z) - Disentangling 3D Prototypical Networks For Few-Shot Concept Learning [29.02523358573336]
We present neural architectures that disentangle RGB-D images into objects' shapes and styles and a map of the background scene.
Our networks incorporate architectural biases that reflect the image formation process, 3D geometry of the world scene, and shape-style interplay.
arXiv Detail & Related papers (2020-11-06T14:08:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.