Towards Explainable 3D Grounded Visual Question Answering: A New
Benchmark and Strong Baseline
- URL: http://arxiv.org/abs/2209.12028v1
- Date: Sat, 24 Sep 2022 15:09:02 GMT
- Title: Towards Explainable 3D Grounded Visual Question Answering: A New
Benchmark and Strong Baseline
- Authors: Lichen Zhao, Daigang Cai, Jing Zhang, Lu Sheng, Dong Xu, Rui Zheng,
Yinjie Zhao, Lipeng Wang and Xibo Fan
- Abstract summary: 3D visual question answering (VQA) task is less exploited and is more susceptible to language priors and co-reference ambiguity.
We collect a new 3D VQA dataset with diverse and relatively free-form question-answer pairs, as well as dense and completely grounded bounding box annotations.
We propose a new 3D VQA framework to effectively predict the completely visually grounded and explainable answer.
- Score: 35.717047755880536
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, 3D vision-and-language tasks have attracted increasing research
interest. Compared to other vision-and-language tasks, the 3D visual question
answering (VQA) task is less exploited and is more susceptible to language
priors and co-reference ambiguity. Meanwhile, a couple of recently proposed 3D
VQA datasets do not well support 3D VQA task due to their limited scale and
annotation methods. In this work, we formally define and address a 3D grounded
VQA task by collecting a new 3D VQA dataset, referred to as FE-3DGQA, with
diverse and relatively free-form question-answer pairs, as well as dense and
completely grounded bounding box annotations. To achieve more explainable
answers, we labelled the objects appeared in the complex QA pairs with
different semantic types, including answer-grounded objects (both appeared and
not appeared in the questions), and contextual objects for answer-grounded
objects. We also propose a new 3D VQA framework to effectively predict the
completely visually grounded and explainable answer. Extensive experiments
verify that our newly collected benchmark datasets can be effectively used to
evaluate various 3D VQA methods from different aspects and our newly proposed
framework also achieves state-of-the-art performance on the new benchmark
dataset. Both the newly collected dataset and our codes will be publicly
available at http://github.com/zlccccc/3DGQA.
Related papers
- Embodied Intelligence for 3D Understanding: A Survey on 3D Scene Question Answering [28.717312557697376]
3D Scene Question Answering represents an interdisciplinary task that integrates 3D visual perception and natural language processing.
Recent advances in large multimodal modelling have driven the creation of diverse datasets and spurred the development of instruction-tuning and zero-shot methods for 3D SQA.
This paper presents the first comprehensive survey of 3D SQA, systematically reviewing datasets, methodologies, and evaluation metrics.
arXiv Detail & Related papers (2025-02-01T07:01:33Z) - AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring [49.78120051062641]
3D visual grounding aims to correlate a natural language description with the target object within a 3D scene.
Existing approaches commonly encounter a shortage of text3D pairs available for training.
We propose AugRefer, a novel approach for advancing 3D visual grounding.
arXiv Detail & Related papers (2025-01-16T09:57:40Z) - Space3D-Bench: Spatial 3D Question Answering Benchmark [49.259397521459114]
We present Space3D-Bench - a collection of 1000 general spatial questions and answers related to scenes of the Replica dataset.
We provide an assessment system that grades natural language responses based on predefined ground-truth answers.
Finally, we introduce a baseline called RAG3D-Chat integrating the world understanding of foundation models with rich context retrieval.
arXiv Detail & Related papers (2024-08-29T16:05:22Z) - MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.
The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z) - Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level.
Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - 3D-Aware Visual Question Answering about Parts, Poses and Occlusions [20.83938624671415]
We introduce the task of 3D-aware VQA, which focuses on challenging questions that require a compositional reasoning over the 3D structure of visual scenes.
We propose PO3D-VQA, a 3D-aware VQA model that marries two powerful ideas: probabilistic neural symbolic program execution for reasoning and deep neural networks with 3D generative representations of objects for robust visual recognition.
Our experimental results show our model PO3D-VQA outperforms existing methods significantly, but we still observe a significant performance gap compared to 2D VQA benchmarks.
arXiv Detail & Related papers (2023-10-27T06:15:30Z) - Multi-CLIP: Contrastive Vision-Language Pre-training for Question
Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore.
We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z) - Knowing Earlier what Right Means to You: A Comprehensive VQA Dataset for
Grounding Relative Directions via Multi-Task Learning [16.538887534958555]
We introduce GRiD-A-3D, a novel diagnostic visual question-answering dataset based on abstract objects.
Our dataset allows for a fine-grained analysis of end-to-end VQA models' capabilities to ground relative directions.
We demonstrate that within a few epochs, the subtasks required to reason over relative directions are learned in the order in which relative directions are intuitively processed.
arXiv Detail & Related papers (2022-07-06T12:31:49Z) - 3D Question Answering [22.203927159777123]
We present the first attempt at extending Visual Question Answering (VQA) to the 3D domain.
We propose a novel transformer-based 3DQA framework textbf3DQA-TR", which consists of two encoders for exploiting the appearance and geometry information.
To verify the effectiveness of our proposed 3DQA framework, we further develop the first 3DQA dataset textbfScanQA".
arXiv Detail & Related papers (2021-12-15T18:59:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.