SQA3D: Situated Question Answering in 3D Scenes
- URL: http://arxiv.org/abs/2210.07474v5
- Date: Wed, 12 Apr 2023 20:05:41 GMT
- Title: SQA3D: Situated Question Answering in 3D Scenes
- Authors: Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang,
Song-Chun Zhu, Siyuan Huang
- Abstract summary: We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D)
Given a scene context, SQA3D requires the tested agent to first understand its situation in the 3D scene as described by text, then reason about its surrounding environment and answer a question under that situation.
Based upon 650 scenes from ScanNet, we provide a dataset centered around 6.8k unique situations, along with 20.4k descriptions and 33.4k diverse reasoning questions for these situations.
- Score: 86.0205305318308
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a new task to benchmark scene understanding of embodied agents:
Situated Question Answering in 3D Scenes (SQA3D). Given a scene context (e.g.,
3D scan), SQA3D requires the tested agent to first understand its situation
(position, orientation, etc.) in the 3D scene as described by text, then reason
about its surrounding environment and answer a question under that situation.
Based upon 650 scenes from ScanNet, we provide a dataset centered around 6.8k
unique situations, along with 20.4k descriptions and 33.4k diverse reasoning
questions for these situations. These questions examine a wide spectrum of
reasoning capabilities for an intelligent agent, ranging from spatial relation
comprehension to commonsense understanding, navigation, and multi-hop
reasoning. SQA3D imposes a significant challenge to current multi-modal
especially 3D reasoning models. We evaluate various state-of-the-art approaches
and find that the best one only achieves an overall score of 47.20%, while
amateur human participants can reach 90.06%. We believe SQA3D could facilitate
future embodied AI research with stronger situation understanding and reasoning
capability.
Related papers
- Multimodal 3D Reasoning Segmentation with Complex Scenes [92.92045550692765]
We bridge the research gaps by proposing a 3D reasoning segmentation task for multiple objects in scenes.
The task allows producing 3D segmentation masks and detailed textual explanations as enriched by 3D spatial relations among objects.
In addition, we design MORE3D, a simple yet effective method that enables multi-object 3D reasoning segmentation with user questions and textual outputs.
arXiv Detail & Related papers (2024-11-21T08:22:45Z) - Situational Awareness Matters in 3D Vision Language Reasoning [30.113617846516398]
SIG3D is an end-to-end Situation-Grounded model for 3D vision language reasoning.
We tokenize the 3D scene into sparse voxel representation and propose a language-grounded situation estimator.
Experiments on the SQA3D and ScanQA datasets show that SIG3D outperforms state-of-the-art models in situation estimation and question answering.
arXiv Detail & Related papers (2024-06-11T17:59:45Z) - Agent3D-Zero: An Agent for Zero-shot 3D Understanding [79.88440434836673]
Agent3D-Zero is an innovative 3D-aware agent framework addressing the 3D scene understanding.
We propose a novel way to make use of a Large Visual Language Model (VLM) via actively selecting and analyzing a series of viewpoints for 3D understanding.
A distinctive advantage of Agent3D-Zero is the introduction of novel visual prompts, which significantly unleash the VLMs' ability to identify the most informative viewpoints.
arXiv Detail & Related papers (2024-03-18T14:47:03Z) - EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards
Embodied AI [88.03089807278188]
EmbodiedScan is a multi-modal, ego-centric 3D perception dataset and benchmark for holistic 3D scene understanding.
It encompasses over 5k scans encapsulating 1M ego-centric RGB-D views, 1M language prompts, 160k 3D-oriented boxes spanning over 760 categories.
Building upon this database, we introduce a baseline framework named Embodied Perceptron.
It is capable of processing an arbitrary number of multi-modal inputs and demonstrates remarkable 3D perception capabilities.
arXiv Detail & Related papers (2023-12-26T18:59:11Z) - 3D-Aware Visual Question Answering about Parts, Poses and Occlusions [20.83938624671415]
We introduce the task of 3D-aware VQA, which focuses on challenging questions that require a compositional reasoning over the 3D structure of visual scenes.
We propose PO3D-VQA, a 3D-aware VQA model that marries two powerful ideas: probabilistic neural symbolic program execution for reasoning and deep neural networks with 3D generative representations of objects for robust visual recognition.
Our experimental results show our model PO3D-VQA outperforms existing methods significantly, but we still observe a significant performance gap compared to 2D VQA benchmarks.
arXiv Detail & Related papers (2023-10-27T06:15:30Z) - NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for
Autonomous Driving Scenario [77.14723238359318]
NuScenesQA is the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs.
We leverage existing 3D detection annotations to generate scene graphs and design question templates manually.
We develop a series of baselines that employ advanced 3D detection and VQA techniques.
arXiv Detail & Related papers (2023-05-24T07:40:50Z) - 3D Concept Learning and Reasoning from Multi-View Images [96.3088005719963]
We introduce a new large-scale benchmark for 3D multi-view visual question answering (3DMV-VQA)
This dataset consists of approximately 5k scenes, 600k images, paired with 50k questions.
We propose a novel 3D concept learning and reasoning framework that seamlessly combines neural fields, 2D pre-trained vision-language models, and neural reasoning operators.
arXiv Detail & Related papers (2023-03-20T17:59:49Z) - Comprehensive Visual Question Answering on Point Clouds through
Compositional Scene Manipulation [33.91844305449863]
We propose the CLEVR3D, a large-scale VQA-3D dataset consisting of 171K questions from 8,771 3D scenes.
We develop a question engine leveraging 3D scene graph structures to generate diverse reasoning questions.
A more challenging setup is proposed to remove the confounding bias and adjust the context from a common-sense layout.
arXiv Detail & Related papers (2021-12-22T06:43:21Z) - ScanQA: 3D Question Answering for Spatial Scene Understanding [7.136295795446433]
We propose a new 3D spatial understanding task of 3D Question Answering (3D-QA)
In the 3D-QA task, models receive visual information from the entire 3D scene of the rich RGB-D indoor scan and answer the given textual questions about the 3D scene.
Our new ScanQA dataset contains over 41K question-answer pairs from the 800 indoor scenes drawn from the ScanNet dataset.
arXiv Detail & Related papers (2021-12-20T12:30:55Z) - 3D Question Answering [22.203927159777123]
We present the first attempt at extending Visual Question Answering (VQA) to the 3D domain.
We propose a novel transformer-based 3DQA framework textbf3DQA-TR", which consists of two encoders for exploiting the appearance and geometry information.
To verify the effectiveness of our proposed 3DQA framework, we further develop the first 3DQA dataset textbfScanQA".
arXiv Detail & Related papers (2021-12-15T18:59:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.