Related papers: Towards Explainable 3D Grounded Visual Question Answering: A New Benchmark and Strong Baseline

Towards Explainable 3D Grounded Visual Question Answering: A New Benchmark and Strong Baseline

URL: http://arxiv.org/abs/2209.12028v1
Date: Sat, 24 Sep 2022 15:09:02 GMT
Title: Towards Explainable 3D Grounded Visual Question Answering: A New Benchmark and Strong Baseline
Authors: Lichen Zhao, Daigang Cai, Jing Zhang, Lu Sheng, Dong Xu, Rui Zheng, Yinjie Zhao, Lipeng Wang and Xibo Fan
Abstract summary: 3D visual question answering (VQA) task is less exploited and is more susceptible to language priors and co-reference ambiguity. We collect a new 3D VQA dataset with diverse and relatively free-form question-answer pairs, as well as dense and completely grounded bounding box annotations. We propose a new 3D VQA framework to effectively predict the completely visually grounded and explainable answer.
Score: 35.717047755880536
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, 3D vision-and-language tasks have attracted increasing research interest. Compared to other vision-and-language tasks, the 3D visual question answering (VQA) task is less exploited and is more susceptible to language priors and co-reference ambiguity. Meanwhile, a couple of recently proposed 3D VQA datasets do not well support 3D VQA task due to their limited scale and annotation methods. In this work, we formally define and address a 3D grounded VQA task by collecting a new 3D VQA dataset, referred to as FE-3DGQA, with diverse and relatively free-form question-answer pairs, as well as dense and completely grounded bounding box annotations. To achieve more explainable answers, we labelled the objects appeared in the complex QA pairs with different semantic types, including answer-grounded objects (both appeared and not appeared in the questions), and contextual objects for answer-grounded objects. We also propose a new 3D VQA framework to effectively predict the completely visually grounded and explainable answer. Extensive experiments verify that our newly collected benchmark datasets can be effectively used to evaluate various 3D VQA methods from different aspects and our newly proposed framework also achieves state-of-the-art performance on the new benchmark dataset. Both the newly collected dataset and our codes will be publicly available at http://github.com/zlccccc/3DGQA.

Related papers

Embodied Intelligence for 3D Understanding: A Survey on 3D Scene Question Answering [28.717312557697376]
3D Scene Question Answering represents an interdisciplinary task that integrates 3D visual perception and natural language processing. Recent advances in large multimodal modelling have driven the creation of diverse datasets and spurred the development of instruction-tuning and zero-shot methods for 3D SQA. This paper presents the first comprehensive survey of 3D SQA, systematically reviewing datasets, methodologies, and evaluation metrics.
arXiv Detail & Related papers (2025-02-01T07:01:33Z)
AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring [49.78120051062641]
3D visual grounding aims to correlate a natural language description with the target object within a 3D scene. Existing approaches commonly encounter a shortage of text3D pairs available for training. We propose AugRefer, a novel approach for advancing 3D visual grounding.
arXiv Detail & Related papers (2025-01-16T09:57:40Z)
Space3D-Bench: Spatial 3D Question Answering Benchmark [49.259397521459114]
We present Space3D-Bench - a collection of 1000 general spatial questions and answers related to scenes of the Replica dataset. We provide an assessment system that grades natural language responses based on predefined ground-truth answers. Finally, we introduce a baseline called RAG3D-Chat integrating the world understanding of foundation models with rich context retrieval.
arXiv Detail & Related papers (2024-08-29T16:05:22Z)
MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan. The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z)
Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA [6.697298321551588]
In 3D Visual Question Answering (3D VQA), the scarcity of fully annotated data and limited visual content diversity hampers the generalization to novel scenes and 3D concepts. We propose question-conditional 2D view selection procedure, pinpointing semantically relevant 2D inputs for crucial visual clues. We then integrate this 2D knowledge into the 3D-VQA system via a two-branch Transformer structure.
arXiv Detail & Related papers (2024-02-24T23:31:34Z)
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level. Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z)
3D-Aware Visual Question Answering about Parts, Poses and Occlusions [20.83938624671415]
We introduce the task of 3D-aware VQA, which focuses on challenging questions that require a compositional reasoning over the 3D structure of visual scenes. We propose PO3D-VQA, a 3D-aware VQA model that marries two powerful ideas: probabilistic neural symbolic program execution for reasoning and deep neural networks with 3D generative representations of objects for robust visual recognition. Our experimental results show our model PO3D-VQA outperforms existing methods significantly, but we still observe a significant performance gap compared to 2D VQA benchmarks.
arXiv Detail & Related papers (2023-10-27T06:15:30Z)
Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore. We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z)
EgoLoc: Revisiting 3D Object Localization from Egocentric Videos with Visual Queries [68.75400888770793]
We formalize a pipeline that better entangles 3D multiview geometry with 2D object retrieval from egocentric videos. Specifically, our approach achieves an overall success rate of up to 87.12%, which sets a new state-of-the-art result in the VQ3D task.
arXiv Detail & Related papers (2022-12-14T01:28:12Z)
CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework. Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene. In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z)
Knowing Earlier what Right Means to You: A Comprehensive VQA Dataset for Grounding Relative Directions via Multi-Task Learning [16.538887534958555]
We introduce GRiD-A-3D, a novel diagnostic visual question-answering dataset based on abstract objects. Our dataset allows for a fine-grained analysis of end-to-end VQA models' capabilities to ground relative directions. We demonstrate that within a few epochs, the subtasks required to reason over relative directions are learned in the order in which relative directions are intuitively processed.
arXiv Detail & Related papers (2022-07-06T12:31:49Z)
3D Question Answering [22.203927159777123]
We present the first attempt at extending Visual Question Answering (VQA) to the 3D domain. We propose a novel transformer-based 3DQA framework textbf3DQA-TR", which consists of two encoders for exploiting the appearance and geometry information. To verify the effectiveness of our proposed 3DQA framework, we further develop the first 3DQA dataset textbfScanQA".
arXiv Detail & Related papers (2021-12-15T18:59:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.