3D Concept Grounding on Neural Fields
- URL: http://arxiv.org/abs/2207.06403v1
- Date: Wed, 13 Jul 2022 17:59:33 GMT
- Title: 3D Concept Grounding on Neural Fields
- Authors: Yining Hong, Yilun Du, Chunru Lin, Joshua B. Tenenbaum, Chuang Gan
- Abstract summary: Existing visual reasoning approaches typically utilize supervised methods to extract 2D segmentation masks on which concepts are grounded.
Humans are capable of grounding concepts on the underlying 3D representation of images.
We propose to leverage the continuous, differentiable nature of neural fields to segment and learn concepts.
- Score: 99.33215488324238
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: In this paper, we address the challenging problem of 3D concept grounding
(i.e. segmenting and learning visual concepts) by looking at RGBD images and
reasoning about paired questions and answers. Existing visual reasoning
approaches typically utilize supervised methods to extract 2D segmentation
masks on which concepts are grounded. In contrast, humans are capable of
grounding concepts on the underlying 3D representation of images. However,
traditionally inferred 3D representations (e.g., point clouds, voxelgrids, and
meshes) cannot capture continuous 3D features flexibly, thus making it
challenging to ground concepts to 3D regions based on the language description
of the object being referred to. To address both issues, we propose to leverage
the continuous, differentiable nature of neural fields to segment and learn
concepts. Specifically, each 3D coordinate in a scene is represented as a
high-dimensional descriptor. Concept grounding can then be performed by
computing the similarity between the descriptor vector of a 3D coordinate and
the vector embedding of a language concept, which enables segmentations and
concept learning to be jointly learned on neural fields in a differentiable
fashion. As a result, both 3D semantic and instance segmentations can emerge
directly from question answering supervision using a set of defined neural
operators on top of neural fields (e.g., filtering and counting). Experimental
results show that our proposed framework outperforms
unsupervised/language-mediated segmentation models on semantic and instance
segmentation tasks, as well as outperforms existing models on the challenging
3D aware visual reasoning tasks. Furthermore, our framework can generalize well
to unseen shape categories and real scans.
Related papers
- Multimodal 3D Reasoning Segmentation with Complex Scenes [92.92045550692765]
We bridge the research gaps by proposing a 3D reasoning segmentation task for multiple objects in scenes.
The task allows producing 3D segmentation masks and detailed textual explanations as enriched by 3D spatial relations among objects.
In addition, we design MORE3D, a simple yet effective method that enables multi-object 3D reasoning segmentation with user questions and textual outputs.
arXiv Detail & Related papers (2024-11-21T08:22:45Z) - Reasoning3D -- Grounding and Reasoning in 3D: Fine-Grained Zero-Shot Open-Vocabulary 3D Reasoning Part Segmentation via Large Vision-Language Models [20.277479473218513]
We introduce a new task: Zero-Shot 3D Reasoning for parts searching and localization for objects.
We design a simple baseline method, Reasoning3D, with the capability to understand and execute complex commands.
We show that Reasoning3D can effectively localize and highlight parts of 3D objects based on implicit textual queries.
arXiv Detail & Related papers (2024-05-29T17:56:07Z) - Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language.
We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z) - 3D Concept Learning and Reasoning from Multi-View Images [96.3088005719963]
We introduce a new large-scale benchmark for 3D multi-view visual question answering (3DMV-VQA)
This dataset consists of approximately 5k scenes, 600k images, paired with 50k questions.
We propose a novel 3D concept learning and reasoning framework that seamlessly combines neural fields, 2D pre-trained vision-language models, and neural reasoning operators.
arXiv Detail & Related papers (2023-03-20T17:59:49Z) - Graphics Capsule: Learning Hierarchical 3D Face Representations from 2D
Images [82.5266467869448]
We propose an Inverse Graphics Capsule Network (IGC-Net) to learn the hierarchical 3D face representations from large-scale unlabeled images.
IGC-Net first decomposes the objects into a set of semantic-consistent part-level descriptions and then assembles them into object-level descriptions to build the hierarchy.
arXiv Detail & Related papers (2023-03-20T06:32:55Z) - Neural Groundplans: Persistent Neural Scene Representations from a
Single Image [90.04272671464238]
We present a method to map 2D image observations of a scene to a persistent 3D scene representation.
We propose conditional neural groundplans as persistent and memory-efficient scene representations.
arXiv Detail & Related papers (2022-07-22T17:41:24Z) - Learning to Reconstruct and Segment 3D Objects [4.709764624933227]
We aim to understand scenes and the objects within them by learning general and robust representations using deep neural networks.
This thesis makes three core contributions from object-level 3D shape estimation from single or multiple views to scene-level semantic understanding.
arXiv Detail & Related papers (2020-10-19T15:09:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.