3D Concept Learning and Reasoning from Multi-View Images
- URL: http://arxiv.org/abs/2303.11327v1
- Date: Mon, 20 Mar 2023 17:59:49 GMT
- Title: 3D Concept Learning and Reasoning from Multi-View Images
- Authors: Yining Hong, Chunru Lin, Yilun Du, Zhenfang Chen, Joshua B. Tenenbaum,
Chuang Gan
- Abstract summary: We introduce a new large-scale benchmark for 3D multi-view visual question answering (3DMV-VQA)
This dataset consists of approximately 5k scenes, 600k images, paired with 50k questions.
We propose a novel 3D concept learning and reasoning framework that seamlessly combines neural fields, 2D pre-trained vision-language models, and neural reasoning operators.
- Score: 96.3088005719963
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Humans are able to accurately reason in 3D by gathering multi-view
observations of the surrounding world. Inspired by this insight, we introduce a
new large-scale benchmark for 3D multi-view visual question answering
(3DMV-VQA). This dataset is collected by an embodied agent actively moving and
capturing RGB images in an environment using the Habitat simulator. In total,
it consists of approximately 5k scenes, 600k images, paired with 50k questions.
We evaluate various state-of-the-art models for visual reasoning on our
benchmark and find that they all perform poorly. We suggest that a principled
approach for 3D reasoning from multi-view images should be to infer a compact
3D representation of the world from the multi-view images, which is further
grounded on open-vocabulary semantic concepts, and then to execute reasoning on
these 3D representations. As the first step towards this approach, we propose a
novel 3D concept learning and reasoning (3D-CLR) framework that seamlessly
combines these components via neural fields, 2D pre-trained vision-language
models, and neural reasoning operators. Experimental results suggest that our
framework outperforms baseline models by a large margin, but the challenge
remains largely unsolved. We further perform an in-depth analysis of the
challenges and highlight potential future directions.
Related papers
- VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding [47.58359136198136]
VisionGPT-3D provides a versatile multimodal framework building upon the strengths of multimodal foundation models.
It seamlessly integrates various SOTA vision models and brings the automation in the selection of SOTA vision models.
It identifies the suitable 3D mesh creation algorithms corresponding to 2D depth maps analysis, generates optimal results based on diverse multimodal inputs.
arXiv Detail & Related papers (2024-03-14T16:13:00Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - Multi-CLIP: Contrastive Vision-Language Pre-training for Question
Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore.
We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z) - RoSI: Recovering 3D Shape Interiors from Few Articulation Images [20.430308190444737]
We present a learning framework to recover the shape interiors of existing 3D models with only their exteriors from multi-view and multi-articulation images.
Our neural architecture is trained in a category-agnostic manner and it consists of a motion-aware multi-view analysis phase.
In addition, our method also predicts part articulations and is able to realize and even extrapolate the captured motions on the target 3D object.
arXiv Detail & Related papers (2023-04-13T08:45:26Z) - 3D Concept Grounding on Neural Fields [99.33215488324238]
Existing visual reasoning approaches typically utilize supervised methods to extract 2D segmentation masks on which concepts are grounded.
Humans are capable of grounding concepts on the underlying 3D representation of images.
We propose to leverage the continuous, differentiable nature of neural fields to segment and learn concepts.
arXiv Detail & Related papers (2022-07-13T17:59:33Z) - Learning Ego 3D Representation as Ray Tracing [42.400505280851114]
We present a novel end-to-end architecture for ego 3D representation learning from unconstrained camera views.
Inspired by the ray tracing principle, we design a polarized grid of "imaginary eyes" as the learnable ego 3D representation.
We show that our model outperforms all state-of-the-art alternatives significantly.
arXiv Detail & Related papers (2022-06-08T17:55:50Z) - DRaCoN -- Differentiable Rasterization Conditioned Neural Radiance
Fields for Articulated Avatars [92.37436369781692]
We present DRaCoN, a framework for learning full-body volumetric avatars.
It exploits the advantages of both the 2D and 3D neural rendering techniques.
Experiments on the challenging ZJU-MoCap and Human3.6M datasets indicate that DRaCoN outperforms state-of-the-art methods.
arXiv Detail & Related papers (2022-03-29T17:59:15Z) - Weakly Supervised Learning of Multi-Object 3D Scene Decompositions Using
Deep Shape Priors [69.02332607843569]
PriSMONet is a novel approach for learning Multi-Object 3D scene decomposition and representations from single images.
A recurrent encoder regresses a latent representation of 3D shape, pose and texture of each object from an input RGB image.
We evaluate the accuracy of our model in inferring 3D scene layout, demonstrate its generative capabilities, assess its generalization to real images, and point out benefits of the learned representation.
arXiv Detail & Related papers (2020-10-08T14:49:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.