Space3D-Bench: Spatial 3D Question Answering Benchmark
- URL: http://arxiv.org/abs/2408.16662v3
- Date: Sun, 15 Sep 2024 15:51:00 GMT
- Title: Space3D-Bench: Spatial 3D Question Answering Benchmark
- Authors: Emilia Szymanska, Mihai Dusmanu, Jan-Willem Buurlage, Mahdi Rad, Marc Pollefeys,
- Abstract summary: We present Space3D-Bench - a collection of 1000 general spatial questions and answers related to scenes of the Replica dataset.
We provide an assessment system that grades natural language responses based on predefined ground-truth answers.
Finally, we introduce a baseline called RAG3D-Chat integrating the world understanding of foundation models with rich context retrieval.
- Score: 49.259397521459114
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Answering questions about the spatial properties of the environment poses challenges for existing language and vision foundation models due to a lack of understanding of the 3D world notably in terms of relationships between objects. To push the field forward, multiple 3D Q&A datasets were proposed which, overall, provide a variety of questions, but they individually focus on particular aspects of 3D reasoning or are limited in terms of data modalities. To address this, we present Space3D-Bench - a collection of 1000 general spatial questions and answers related to scenes of the Replica dataset which offers a variety of data modalities: point clouds, posed RGB-D images, navigation meshes and 3D object detections. To ensure that the questions cover a wide range of 3D objectives, we propose an indoor spatial questions taxonomy inspired by geographic information systems and use it to balance the dataset accordingly. Moreover, we provide an assessment system that grades natural language responses based on predefined ground-truth answers by leveraging a Vision Language Model's comprehension of both text and images to compare the responses with ground-truth textual information or relevant visual data. Finally, we introduce a baseline called RAG3D-Chat integrating the world understanding of foundation models with rich context retrieval, achieving an accuracy of 67% on the proposed dataset.
Related papers
- Embodied Intelligence for 3D Understanding: A Survey on 3D Scene Question Answering [28.717312557697376]
3D Scene Question Answering represents an interdisciplinary task that integrates 3D visual perception and natural language processing.
Recent advances in large multimodal modelling have driven the creation of diverse datasets and spurred the development of instruction-tuning and zero-shot methods for 3D SQA.
This paper presents the first comprehensive survey of 3D SQA, systematically reviewing datasets, methodologies, and evaluation metrics.
arXiv Detail & Related papers (2025-02-01T07:01:33Z) - AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring [49.78120051062641]
3D visual grounding aims to correlate a natural language description with the target object within a 3D scene.
Existing approaches commonly encounter a shortage of text3D pairs available for training.
We propose AugRefer, a novel approach for advancing 3D visual grounding.
arXiv Detail & Related papers (2025-01-16T09:57:40Z) - GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding [53.42728468191711]
Open-Vocabulary 3D object affordance grounding aims to anticipate action possibilities'' regions on 3D objects with arbitrary instructions.
We propose GREAT (GeometRy-intEntion collAboraTive inference) for Open-Vocabulary 3D Object Affordance Grounding.
arXiv Detail & Related papers (2024-11-29T11:23:15Z) - MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.
The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z) - Towards Explainable 3D Grounded Visual Question Answering: A New
Benchmark and Strong Baseline [35.717047755880536]
3D visual question answering (VQA) task is less exploited and is more susceptible to language priors and co-reference ambiguity.
We collect a new 3D VQA dataset with diverse and relatively free-form question-answer pairs, as well as dense and completely grounded bounding box annotations.
We propose a new 3D VQA framework to effectively predict the completely visually grounded and explainable answer.
arXiv Detail & Related papers (2022-09-24T15:09:02Z) - Toward 3D Spatial Reasoning for Human-like Text-based Visual Question
Answering [23.083935053799145]
Text-based Visual Question Answering(TextVQA) aims to produce correct answers for given questions about the images with multiple scene texts.
We introduce 3D geometric information into a human-like spatial reasoning process to capture key objects' contextual knowledge.
Our method achieves state-of-the-art performance on TextVQA and ST-VQA datasets.
arXiv Detail & Related papers (2022-09-21T12:49:14Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.