Comprehensive Visual Question Answering on Point Clouds through
Compositional Scene Manipulation
- URL: http://arxiv.org/abs/2112.11691v3
- Date: Mon, 22 May 2023 02:55:52 GMT
- Title: Comprehensive Visual Question Answering on Point Clouds through
Compositional Scene Manipulation
- Authors: Xu Yan, Zhihao Yuan, Yuhao Du, Yinghong Liao, Yao Guo, Zhen Li,
Shuguang Cui
- Abstract summary: We propose the CLEVR3D, a large-scale VQA-3D dataset consisting of 171K questions from 8,771 3D scenes.
We develop a question engine leveraging 3D scene graph structures to generate diverse reasoning questions.
A more challenging setup is proposed to remove the confounding bias and adjust the context from a common-sense layout.
- Score: 33.91844305449863
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Question Answering on 3D Point Cloud (VQA-3D) is an emerging yet
challenging field that aims at answering various types of textual questions
given an entire point cloud scene. To tackle this problem, we propose the
CLEVR3D, a large-scale VQA-3D dataset consisting of 171K questions from 8,771
3D scenes. Specifically, we develop a question engine leveraging 3D scene graph
structures to generate diverse reasoning questions, covering the questions of
objects' attributes (i.e., size, color, and material) and their spatial
relationships. Through such a manner, we initially generated 44K questions from
1,333 real-world scenes. Moreover, a more challenging setup is proposed to
remove the confounding bias and adjust the context from a common-sense layout.
Such a setup requires the network to achieve comprehensive visual understanding
when the 3D scene is different from the general co-occurrence context (e.g.,
chairs always exist with tables). To this end, we further introduce the
compositional scene manipulation strategy and generate 127K questions from
7,438 augmented 3D scenes, which can improve VQA-3D models for real-world
comprehension. Built upon the proposed dataset, we baseline several VQA-3D
models, where experimental results verify that the CLEVR3D can significantly
boost other 3D scene understanding tasks. Our code and dataset will be made
publicly available at https://github.com/yanx27/CLEVR3D.
Related papers
- Multimodal 3D Reasoning Segmentation with Complex Scenes [92.92045550692765]
We bridge the research gaps by proposing a 3D reasoning segmentation task for multiple objects in scenes.
The task allows producing 3D segmentation masks and detailed textual explanations as enriched by 3D spatial relations among objects.
In addition, we design MORE3D, a simple yet effective method that enables multi-object 3D reasoning segmentation with user questions and textual outputs.
arXiv Detail & Related papers (2024-11-21T08:22:45Z) - Space3D-Bench: Spatial 3D Question Answering Benchmark [49.259397521459114]
We present Space3D-Bench - a collection of 1000 general spatial questions and answers related to scenes of the Replica dataset.
We provide an assessment system that grades natural language responses based on predefined ground-truth answers.
Finally, we introduce a baseline called RAG3D-Chat integrating the world understanding of foundation models with rich context retrieval.
arXiv Detail & Related papers (2024-08-29T16:05:22Z) - Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion
Approach for 3D VQA [6.697298321551588]
In 3D Visual Question Answering (3D VQA), the scarcity of fully annotated data and limited visual content diversity hampers the generalization to novel scenes and 3D concepts.
We propose question-conditional 2D view selection procedure, pinpointing semantically relevant 2D inputs for crucial visual clues.
We then integrate this 2D knowledge into the 3D-VQA system via a two-branch Transformer structure.
arXiv Detail & Related papers (2024-02-24T23:31:34Z) - Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level.
Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - 3D-Aware Visual Question Answering about Parts, Poses and Occlusions [20.83938624671415]
We introduce the task of 3D-aware VQA, which focuses on challenging questions that require a compositional reasoning over the 3D structure of visual scenes.
We propose PO3D-VQA, a 3D-aware VQA model that marries two powerful ideas: probabilistic neural symbolic program execution for reasoning and deep neural networks with 3D generative representations of objects for robust visual recognition.
Our experimental results show our model PO3D-VQA outperforms existing methods significantly, but we still observe a significant performance gap compared to 2D VQA benchmarks.
arXiv Detail & Related papers (2023-10-27T06:15:30Z) - SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving [98.74706005223685]
3D scene understanding plays a vital role in vision-based autonomous driving.
We propose a SurroundOcc method to predict the 3D occupancy with multi-camera images.
arXiv Detail & Related papers (2023-03-16T17:59:08Z) - EgoLoc: Revisiting 3D Object Localization from Egocentric Videos with
Visual Queries [68.75400888770793]
We formalize a pipeline that better entangles 3D multiview geometry with 2D object retrieval from egocentric videos.
Specifically, our approach achieves an overall success rate of up to 87.12%, which sets a new state-of-the-art result in the VQ3D task.
arXiv Detail & Related papers (2022-12-14T01:28:12Z) - SQA3D: Situated Question Answering in 3D Scenes [86.0205305318308]
We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D)
Given a scene context, SQA3D requires the tested agent to first understand its situation in the 3D scene as described by text, then reason about its surrounding environment and answer a question under that situation.
Based upon 650 scenes from ScanNet, we provide a dataset centered around 6.8k unique situations, along with 20.4k descriptions and 33.4k diverse reasoning questions for these situations.
arXiv Detail & Related papers (2022-10-14T02:52:26Z) - ScanQA: 3D Question Answering for Spatial Scene Understanding [7.136295795446433]
We propose a new 3D spatial understanding task of 3D Question Answering (3D-QA)
In the 3D-QA task, models receive visual information from the entire 3D scene of the rich RGB-D indoor scan and answer the given textual questions about the 3D scene.
Our new ScanQA dataset contains over 41K question-answer pairs from the 800 indoor scenes drawn from the ScanNet dataset.
arXiv Detail & Related papers (2021-12-20T12:30:55Z) - 3D Question Answering [22.203927159777123]
We present the first attempt at extending Visual Question Answering (VQA) to the 3D domain.
We propose a novel transformer-based 3DQA framework textbf3DQA-TR", which consists of two encoders for exploiting the appearance and geometry information.
To verify the effectiveness of our proposed 3DQA framework, we further develop the first 3DQA dataset textbfScanQA".
arXiv Detail & Related papers (2021-12-15T18:59:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.