3D Question Answering
- URL: http://arxiv.org/abs/2112.08359v1
- Date: Wed, 15 Dec 2021 18:59:59 GMT
- Title: 3D Question Answering
- Authors: Shuquan Ye and Dongdong Chen and Songfang Han and Jing Liao
- Abstract summary: We present the first attempt at extending Visual Question Answering (VQA) to the 3D domain.
We propose a novel transformer-based 3DQA framework textbf3DQA-TR", which consists of two encoders for exploiting the appearance and geometry information.
To verify the effectiveness of our proposed 3DQA framework, we further develop the first 3DQA dataset textbfScanQA".
- Score: 22.203927159777123
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Question Answering (VQA) has witnessed tremendous progress in recent
years. However, most efforts only focus on the 2D image question answering
tasks. In this paper, we present the first attempt at extending VQA to the 3D
domain, which can facilitate artificial intelligence's perception of 3D
real-world scenarios. Different from image based VQA, 3D Question Answering
(3DQA) takes the color point cloud as input and requires both appearance and 3D
geometry comprehension ability to answer the 3D-related questions. To this end,
we propose a novel transformer-based 3DQA framework \textbf{``3DQA-TR"}, which
consists of two encoders for exploiting the appearance and geometry
information, respectively. The multi-modal information of appearance, geometry,
and the linguistic question can finally attend to each other via a
3D-Linguistic Bert to predict the target answers. To verify the effectiveness
of our proposed 3DQA framework, we further develop the first 3DQA dataset
\textbf{``ScanQA"}, which builds on the ScanNet dataset and contains $\sim$6K
questions, $\sim$30K answers for $806$ scenes. Extensive experiments on this
dataset demonstrate the obvious superiority of our proposed 3DQA framework over
existing VQA frameworks, and the effectiveness of our major designs. Our code
and dataset will be made publicly available to facilitate the research in this
direction.
Related papers
- Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion
Approach for 3D VQA [6.697298321551588]
In 3D Visual Question Answering (3D VQA), the scarcity of fully annotated data and limited visual content diversity hampers the generalization to novel scenes and 3D concepts.
We propose question-conditional 2D view selection procedure, pinpointing semantically relevant 2D inputs for crucial visual clues.
We then integrate this 2D knowledge into the 3D-VQA system via a two-branch Transformer structure.
arXiv Detail & Related papers (2024-02-24T23:31:34Z) - 3D-Aware Visual Question Answering about Parts, Poses and Occlusions [20.83938624671415]
We introduce the task of 3D-aware VQA, which focuses on challenging questions that require a compositional reasoning over the 3D structure of visual scenes.
We propose PO3D-VQA, a 3D-aware VQA model that marries two powerful ideas: probabilistic neural symbolic program execution for reasoning and deep neural networks with 3D generative representations of objects for robust visual recognition.
Our experimental results show our model PO3D-VQA outperforms existing methods significantly, but we still observe a significant performance gap compared to 2D VQA benchmarks.
arXiv Detail & Related papers (2023-10-27T06:15:30Z) - Multi-CLIP: Contrastive Vision-Language Pre-training for Question
Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore.
We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z) - NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for
Autonomous Driving Scenario [77.14723238359318]
NuScenesQA is the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs.
We leverage existing 3D detection annotations to generate scene graphs and design question templates manually.
We develop a series of baselines that employ advanced 3D detection and VQA techniques.
arXiv Detail & Related papers (2023-05-24T07:40:50Z) - EgoLoc: Revisiting 3D Object Localization from Egocentric Videos with
Visual Queries [68.75400888770793]
We formalize a pipeline that better entangles 3D multiview geometry with 2D object retrieval from egocentric videos.
Specifically, our approach achieves an overall success rate of up to 87.12%, which sets a new state-of-the-art result in the VQ3D task.
arXiv Detail & Related papers (2022-12-14T01:28:12Z) - SQA3D: Situated Question Answering in 3D Scenes [86.0205305318308]
We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D)
Given a scene context, SQA3D requires the tested agent to first understand its situation in the 3D scene as described by text, then reason about its surrounding environment and answer a question under that situation.
Based upon 650 scenes from ScanNet, we provide a dataset centered around 6.8k unique situations, along with 20.4k descriptions and 33.4k diverse reasoning questions for these situations.
arXiv Detail & Related papers (2022-10-14T02:52:26Z) - Towards Explainable 3D Grounded Visual Question Answering: A New
Benchmark and Strong Baseline [35.717047755880536]
3D visual question answering (VQA) task is less exploited and is more susceptible to language priors and co-reference ambiguity.
We collect a new 3D VQA dataset with diverse and relatively free-form question-answer pairs, as well as dense and completely grounded bounding box annotations.
We propose a new 3D VQA framework to effectively predict the completely visually grounded and explainable answer.
arXiv Detail & Related papers (2022-09-24T15:09:02Z) - Towards 3D VR-Sketch to 3D Shape Retrieval [128.47604316459905]
We study the use of 3D sketches as an input modality and advocate a VR-scenario where retrieval is conducted.
As a first stab at this new 3D VR-sketch to 3D shape retrieval problem, we make four contributions.
arXiv Detail & Related papers (2022-09-20T22:04:31Z) - Comprehensive Visual Question Answering on Point Clouds through
Compositional Scene Manipulation [33.91844305449863]
We propose the CLEVR3D, a large-scale VQA-3D dataset consisting of 171K questions from 8,771 3D scenes.
We develop a question engine leveraging 3D scene graph structures to generate diverse reasoning questions.
A more challenging setup is proposed to remove the confounding bias and adjust the context from a common-sense layout.
arXiv Detail & Related papers (2021-12-22T06:43:21Z) - ScanQA: 3D Question Answering for Spatial Scene Understanding [7.136295795446433]
We propose a new 3D spatial understanding task of 3D Question Answering (3D-QA)
In the 3D-QA task, models receive visual information from the entire 3D scene of the rich RGB-D indoor scan and answer the given textual questions about the 3D scene.
Our new ScanQA dataset contains over 41K question-answer pairs from the 800 indoor scenes drawn from the ScanNet dataset.
arXiv Detail & Related papers (2021-12-20T12:30:55Z) - FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection [78.00922683083776]
It is non-trivial to make a general adapted 2D detector work in this 3D task.
In this technical report, we study this problem with a practice built on fully convolutional single-stage detector.
Our solution achieves 1st place out of all the vision-only methods in the nuScenes 3D detection challenge of NeurIPS 2020.
arXiv Detail & Related papers (2021-04-22T09:35:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.