3D-Aware Visual Question Answering about Parts, Poses and Occlusions
- URL: http://arxiv.org/abs/2310.17914v1
- Date: Fri, 27 Oct 2023 06:15:30 GMT
- Title: 3D-Aware Visual Question Answering about Parts, Poses and Occlusions
- Authors: Xingrui Wang, Wufei Ma, Zhuowan Li, Adam Kortylewski, Alan Yuille
- Abstract summary: We introduce the task of 3D-aware VQA, which focuses on challenging questions that require a compositional reasoning over the 3D structure of visual scenes.
We propose PO3D-VQA, a 3D-aware VQA model that marries two powerful ideas: probabilistic neural symbolic program execution for reasoning and deep neural networks with 3D generative representations of objects for robust visual recognition.
Our experimental results show our model PO3D-VQA outperforms existing methods significantly, but we still observe a significant performance gap compared to 2D VQA benchmarks.
- Score: 20.83938624671415
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite rapid progress in Visual question answering (VQA), existing datasets
and models mainly focus on testing reasoning in 2D. However, it is important
that VQA models also understand the 3D structure of visual scenes, for example
to support tasks like navigation or manipulation. This includes an
understanding of the 3D object pose, their parts and occlusions. In this work,
we introduce the task of 3D-aware VQA, which focuses on challenging questions
that require a compositional reasoning over the 3D structure of visual scenes.
We address 3D-aware VQA from both the dataset and the model perspective. First,
we introduce Super-CLEVR-3D, a compositional reasoning dataset that contains
questions about object parts, their 3D poses, and occlusions. Second, we
propose PO3D-VQA, a 3D-aware VQA model that marries two powerful ideas:
probabilistic neural symbolic program execution for reasoning and deep neural
networks with 3D generative representations of objects for robust visual
recognition. Our experimental results show our model PO3D-VQA outperforms
existing methods significantly, but we still observe a significant performance
gap compared to 2D VQA benchmarks, indicating that 3D-aware VQA remains an
important open research area.
Related papers
- Implicit-Zoo: A Large-Scale Dataset of Neural Implicit Functions for 2D Images and 3D Scenes [65.22070581594426]
"Implicit-Zoo" is a large-scale dataset requiring thousands of GPU training days to facilitate research and development in this field.
We showcase two immediate benefits as it enables to: (1) learn token locations for transformer models; (2) directly regress 3D cameras poses of 2D images with respect to NeRF models.
This in turn leads to an improved performance in all three task of image classification, semantic segmentation, and 3D pose regression, thereby unlocking new avenues for research.
arXiv Detail & Related papers (2024-06-25T10:20:44Z) - Probing the 3D Awareness of Visual Foundation Models [56.68380136809413]
We analyze the 3D awareness of visual foundation models.
We conduct experiments using task-specific probes and zero-shot inference procedures on frozen features.
arXiv Detail & Related papers (2024-04-12T17:58:04Z) - Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion
Approach for 3D VQA [6.697298321551588]
In 3D Visual Question Answering (3D VQA), the scarcity of fully annotated data and limited visual content diversity hampers the generalization to novel scenes and 3D concepts.
We propose question-conditional 2D view selection procedure, pinpointing semantically relevant 2D inputs for crucial visual clues.
We then integrate this 2D knowledge into the 3D-VQA system via a two-branch Transformer structure.
arXiv Detail & Related papers (2024-02-24T23:31:34Z) - Multi-CLIP: Contrastive Vision-Language Pre-training for Question
Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore.
We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z) - EgoLoc: Revisiting 3D Object Localization from Egocentric Videos with
Visual Queries [68.75400888770793]
We formalize a pipeline that better entangles 3D multiview geometry with 2D object retrieval from egocentric videos.
Specifically, our approach achieves an overall success rate of up to 87.12%, which sets a new state-of-the-art result in the VQ3D task.
arXiv Detail & Related papers (2022-12-14T01:28:12Z) - SQA3D: Situated Question Answering in 3D Scenes [86.0205305318308]
We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D)
Given a scene context, SQA3D requires the tested agent to first understand its situation in the 3D scene as described by text, then reason about its surrounding environment and answer a question under that situation.
Based upon 650 scenes from ScanNet, we provide a dataset centered around 6.8k unique situations, along with 20.4k descriptions and 33.4k diverse reasoning questions for these situations.
arXiv Detail & Related papers (2022-10-14T02:52:26Z) - Towards Explainable 3D Grounded Visual Question Answering: A New
Benchmark and Strong Baseline [35.717047755880536]
3D visual question answering (VQA) task is less exploited and is more susceptible to language priors and co-reference ambiguity.
We collect a new 3D VQA dataset with diverse and relatively free-form question-answer pairs, as well as dense and completely grounded bounding box annotations.
We propose a new 3D VQA framework to effectively predict the completely visually grounded and explainable answer.
arXiv Detail & Related papers (2022-09-24T15:09:02Z) - Point2Seq: Detecting 3D Objects as Sequences [58.63662049729309]
We present a simple and effective framework, named Point2Seq, for 3D object detection from point clouds.
We view each 3D object as a sequence of words and reformulate the 3D object detection task as decoding words from 3D scenes in an auto-regressive manner.
arXiv Detail & Related papers (2022-03-25T00:20:31Z) - 3D Question Answering [22.203927159777123]
We present the first attempt at extending Visual Question Answering (VQA) to the 3D domain.
We propose a novel transformer-based 3DQA framework textbf3DQA-TR", which consists of two encoders for exploiting the appearance and geometry information.
To verify the effectiveness of our proposed 3DQA framework, we further develop the first 3DQA dataset textbfScanQA".
arXiv Detail & Related papers (2021-12-15T18:59:59Z) - CoCoNets: Continuous Contrastive 3D Scene Representations [21.906643302668716]
This paper explores self-supervised learning of amodal 3D feature representations from RGB and RGB-D posed images and videos.
We show the resulting 3D visual feature representations effectively scale across objects and scenes, imagine information occluded or missing from the input viewpoints, track objects over time, align semantically related objects in 3D, and improve 3D object detection.
arXiv Detail & Related papers (2021-04-08T15:50:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.