Semantic Abstraction: Open-World 3D Scene Understanding from 2D
Vision-Language Models
- URL: http://arxiv.org/abs/2207.11514v1
- Date: Sat, 23 Jul 2022 13:10:25 GMT
- Title: Semantic Abstraction: Open-World 3D Scene Understanding from 2D
Vision-Language Models
- Authors: Huy Ha, Shuran Song
- Abstract summary: We study open-world 3D scene understanding, a family of tasks that require agents to reason about their 3D environment with an open-set vocabulary and out-of-domain visual inputs.
We propose Semantic Abstraction (SemAbs), a framework that equips 2D Vision-Language Models with new 3D spatial capabilities.
We demonstrate the usefulness of SemAbs on two open-world 3D scene understanding tasks.
- Score: 17.606199768716532
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study open-world 3D scene understanding, a family of tasks that require
agents to reason about their 3D environment with an open-set vocabulary and
out-of-domain visual inputs - a critical skill for robots to operate in the
unstructured 3D world. Towards this end, we propose Semantic Abstraction
(SemAbs), a framework that equips 2D Vision-Language Models (VLMs) with new 3D
spatial capabilities, while maintaining their zero-shot robustness. We achieve
this abstraction using relevancy maps extracted from CLIP, and learn 3D spatial
and geometric reasoning skills on top of those abstractions in a
semantic-agnostic manner. We demonstrate the usefulness of SemAbs on two
open-world 3D scene understanding tasks: 1) completing partially observed
objects and 2) localizing hidden objects from language descriptions.
Experiments show that SemAbs can generalize to novel vocabulary,
materials/lighting, classes, and domains (i.e., real-world scans) from training
on limited 3D synthetic data. Code and data will be available at
https://semantic-abstraction.cs.columbia.edu/
Related papers
- Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image [70.02187124865627]
Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene.
We leverage a vision foundation model to provide image-wise guidance for discovering novel classes in 3D scenes.
We demonstrate significant improvements in accuracy and generalization, highlighting the potential of foundation models in advancing open-vocabulary 3D object detection.
arXiv Detail & Related papers (2024-07-07T04:50:04Z) - Reasoning3D -- Grounding and Reasoning in 3D: Fine-Grained Zero-Shot Open-Vocabulary 3D Reasoning Part Segmentation via Large Vision-Language Models [20.277479473218513]
We introduce a new task: Zero-Shot 3D Reasoning for parts searching and localization for objects.
We design a simple baseline method, Reasoning3D, with the capability to understand and execute complex commands.
We show that Reasoning3D can effectively localize and highlight parts of 3D objects based on implicit textual queries.
arXiv Detail & Related papers (2024-05-29T17:56:07Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images [32.33170182669095]
We describe an approach to predict open-vocabulary 3D semantic voxel occupancy map from input 2D images.
The architecture consists of a 2D-3D encoder together with occupancy prediction and 3D-language heads.
The output is a dense voxel map of 3D grounded language embeddings enabling a range of open-vocabulary tasks.
arXiv Detail & Related papers (2024-01-17T18:51:53Z) - Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language.
We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z) - Lowis3D: Language-Driven Open-World Instance-Level 3D Scene
Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset.
This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories.
We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z) - Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation [44.58709274218105]
This work bridges the 2D-to-3D gap for robotic manipulation by leveraging distilled feature fields to combine accurate 3D geometry with rich semantics from 2D foundation models.
We present a few-shot learning method for 6-DOF grasping and placing that harnesses these strong spatial and semantic priors to achieve in-the-wild generalization to unseen objects.
arXiv Detail & Related papers (2023-07-27T17:59:14Z) - CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World
Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios.
Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.