ScanQA: 3D Question Answering for Spatial Scene Understanding
- URL: http://arxiv.org/abs/2112.10482v1
- Date: Mon, 20 Dec 2021 12:30:55 GMT
- Title: ScanQA: 3D Question Answering for Spatial Scene Understanding
- Authors: Daichi Azuma, Taiki Miyanishi, Shuhei Kurita and Motoki Kawanabe
- Abstract summary: We propose a new 3D spatial understanding task of 3D Question Answering (3D-QA)
In the 3D-QA task, models receive visual information from the entire 3D scene of the rich RGB-D indoor scan and answer the given textual questions about the 3D scene.
Our new ScanQA dataset contains over 41K question-answer pairs from the 800 indoor scenes drawn from the ScanNet dataset.
- Score: 7.136295795446433
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a new 3D spatial understanding task of 3D Question Answering
(3D-QA). In the 3D-QA task, models receive visual information from the entire
3D scene of the rich RGB-D indoor scan and answer the given textual questions
about the 3D scene. Unlike the 2D-question answering of VQA, the conventional
2D-QA models suffer from problems with spatial understanding of object
alignment and directions and fail the object localization from the textual
questions in 3D-QA. We propose a baseline model for 3D-QA, named ScanQA model,
where the model learns a fused descriptor from 3D object proposals and encoded
sentence embeddings. This learned descriptor correlates the language
expressions with the underlying geometric features of the 3D scan and
facilitates the regression of 3D bounding boxes to determine described objects
in textual questions. We collected human-edited question-answer pairs with
free-form answers that are grounded to 3D objects in each 3D scene. Our new
ScanQA dataset contains over 41K question-answer pairs from the 800 indoor
scenes drawn from the ScanNet dataset. To the best of our knowledge, ScanQA is
the first large-scale effort to perform object-grounded question-answering in
3D environments.
Related papers
- General Geometry-aware Weakly Supervised 3D Object Detection [62.26729317523975]
A unified framework is developed for learning 3D object detectors from RGB images and associated 2D boxes.
Experiments on KITTI and SUN-RGBD datasets demonstrate that our method yields surprisingly high-quality 3D bounding boxes with only 2D annotation.
arXiv Detail & Related papers (2024-07-18T17:52:08Z) - Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion
Approach for 3D VQA [6.697298321551588]
In 3D Visual Question Answering (3D VQA), the scarcity of fully annotated data and limited visual content diversity hampers the generalization to novel scenes and 3D concepts.
We propose question-conditional 2D view selection procedure, pinpointing semantically relevant 2D inputs for crucial visual clues.
We then integrate this 2D knowledge into the 3D-VQA system via a two-branch Transformer structure.
arXiv Detail & Related papers (2024-02-24T23:31:34Z) - Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level.
Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - 3D-Aware Visual Question Answering about Parts, Poses and Occlusions [20.83938624671415]
We introduce the task of 3D-aware VQA, which focuses on challenging questions that require a compositional reasoning over the 3D structure of visual scenes.
We propose PO3D-VQA, a 3D-aware VQA model that marries two powerful ideas: probabilistic neural symbolic program execution for reasoning and deep neural networks with 3D generative representations of objects for robust visual recognition.
Our experimental results show our model PO3D-VQA outperforms existing methods significantly, but we still observe a significant performance gap compared to 2D VQA benchmarks.
arXiv Detail & Related papers (2023-10-27T06:15:30Z) - Multi-CLIP: Contrastive Vision-Language Pre-training for Question
Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore.
We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z) - EgoLoc: Revisiting 3D Object Localization from Egocentric Videos with
Visual Queries [68.75400888770793]
We formalize a pipeline that better entangles 3D multiview geometry with 2D object retrieval from egocentric videos.
Specifically, our approach achieves an overall success rate of up to 87.12%, which sets a new state-of-the-art result in the VQ3D task.
arXiv Detail & Related papers (2022-12-14T01:28:12Z) - Learning 3D Scene Priors with 2D Supervision [37.79852635415233]
We propose a new method to learn 3D scene priors of layout and shape without requiring any 3D ground truth.
Our method represents a 3D scene as a latent vector, from which we can progressively decode to a sequence of objects characterized by their class categories.
Experiments on 3D-FRONT and ScanNet show that our method outperforms state of the art in single-view reconstruction.
arXiv Detail & Related papers (2022-11-25T15:03:32Z) - SQA3D: Situated Question Answering in 3D Scenes [86.0205305318308]
We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D)
Given a scene context, SQA3D requires the tested agent to first understand its situation in the 3D scene as described by text, then reason about its surrounding environment and answer a question under that situation.
Based upon 650 scenes from ScanNet, we provide a dataset centered around 6.8k unique situations, along with 20.4k descriptions and 33.4k diverse reasoning questions for these situations.
arXiv Detail & Related papers (2022-10-14T02:52:26Z) - Neural Correspondence Field for Object Pose Estimation [67.96767010122633]
We propose a method for estimating the 6DoF pose of a rigid object with an available 3D model from a single RGB image.
Unlike classical correspondence-based methods which predict 3D object coordinates at pixels of the input image, the proposed method predicts 3D object coordinates at 3D query points sampled in the camera frustum.
arXiv Detail & Related papers (2022-07-30T01:48:23Z) - Comprehensive Visual Question Answering on Point Clouds through
Compositional Scene Manipulation [33.91844305449863]
We propose the CLEVR3D, a large-scale VQA-3D dataset consisting of 171K questions from 8,771 3D scenes.
We develop a question engine leveraging 3D scene graph structures to generate diverse reasoning questions.
A more challenging setup is proposed to remove the confounding bias and adjust the context from a common-sense layout.
arXiv Detail & Related papers (2021-12-22T06:43:21Z) - 3D Question Answering [22.203927159777123]
We present the first attempt at extending Visual Question Answering (VQA) to the 3D domain.
We propose a novel transformer-based 3DQA framework textbf3DQA-TR", which consists of two encoders for exploiting the appearance and geometry information.
To verify the effectiveness of our proposed 3DQA framework, we further develop the first 3DQA dataset textbfScanQA".
arXiv Detail & Related papers (2021-12-15T18:59:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.