Toward 3D Spatial Reasoning for Human-like Text-based Visual Question
Answering
- URL: http://arxiv.org/abs/2209.10326v2
- Date: Thu, 15 Jun 2023 02:38:25 GMT
- Title: Toward 3D Spatial Reasoning for Human-like Text-based Visual Question
Answering
- Authors: Hao Li, Jinfa Huang, Peng Jin, Guoli Song, Qi Wu, Jie Chen
- Abstract summary: Text-based Visual Question Answering(TextVQA) aims to produce correct answers for given questions about the images with multiple scene texts.
We introduce 3D geometric information into a human-like spatial reasoning process to capture key objects' contextual knowledge.
Our method achieves state-of-the-art performance on TextVQA and ST-VQA datasets.
- Score: 23.083935053799145
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-based Visual Question Answering~(TextVQA) aims to produce correct
answers for given questions about the images with multiple scene texts. In most
cases, the texts naturally attach to the surface of the objects. Therefore,
spatial reasoning between texts and objects is crucial in TextVQA. However,
existing approaches are constrained within 2D spatial information learned from
the input images and rely on transformer-based architectures to reason
implicitly during the fusion process. Under this setting, these 2D spatial
reasoning approaches cannot distinguish the fine-grain spatial relations
between visual objects and scene texts on the same image plane, thereby
impairing the interpretability and performance of TextVQA models. In this
paper, we introduce 3D geometric information into a human-like spatial
reasoning process to capture the contextual knowledge of key objects
step-by-step. %we formulate a human-like spatial reasoning process by
introducing 3D geometric information for capturing key objects' contextual
knowledge. To enhance the model's understanding of 3D spatial relationships,
Specifically, (i)~we propose a relation prediction module for accurately
locating the region of interest of critical objects; (ii)~we design a
depth-aware attention calibration module for calibrating the OCR tokens'
attention according to critical objects. Extensive experiments show that our
method achieves state-of-the-art performance on TextVQA and ST-VQA datasets.
More encouragingly, our model surpasses others by clear margins of 5.7\% and
12.1\% on questions that involve spatial reasoning in TextVQA and ST-VQA valid
split. Besides, we also verify the generalizability of our model on the
text-based image captioning task.
Related papers
- Space3D-Bench: Spatial 3D Question Answering Benchmark [49.259397521459114]
We present Space3D-Bench - a collection of 1000 general spatial questions and answers related to scenes of the Replica dataset.
We provide an assessment system that grades natural language responses based on predefined ground-truth answers.
Finally, we introduce a baseline called RAG3D-Chat integrating the world understanding of foundation models with rich context retrieval.
arXiv Detail & Related papers (2024-08-29T16:05:22Z) - MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.
The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z) - SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph
Attention [19.23636231942245]
We propose a semantic-enhanced relational learning model based on a graph network with our designed memory graph attention layer.
Our method replaces original language-independent encoding with cross-modal encoding in visual analysis.
Experimental results on ReferIt3D and ScanRefer benchmarks show that the proposed method outperforms the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-13T02:11:04Z) - Generating Visual Spatial Description via Holistic 3D Scene
Understanding [88.99773815159345]
Visual spatial description (VSD) aims to generate texts that describe the spatial relations of the given objects within images.
With an external 3D scene extractor, we obtain the 3D objects and scene features for input images.
We construct a target object-centered 3D spatial scene graph (Go3D-S2G), such that we model the spatial semantics of target objects within the holistic 3D scenes.
arXiv Detail & Related papers (2023-05-19T15:53:56Z) - Benchmarking Spatial Relationships in Text-to-Image Generation [102.62422723894232]
We investigate the ability of text-to-image models to generate correct spatial relationships among objects.
We present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image.
Our experiments reveal a surprising finding that, although state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations between them.
arXiv Detail & Related papers (2022-12-20T06:03:51Z) - Language Conditioned Spatial Relation Reasoning for 3D Object Grounding [87.03299519917019]
Localizing objects in 3D scenes based on natural language requires understanding and reasoning about spatial relations.
We propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.
arXiv Detail & Related papers (2022-11-17T16:42:39Z) - SpartQA: : A Textual Question Answering Benchmark for Spatial Reasoning [10.810615375345511]
This paper proposes a benchmark for spatial reasoning on natural language text.
We design grammar and reasoning rules to automatically generate a spatial description of visual scenes and corresponding QA pairs.
Experiments show that further pretraining LMs on these automatically generated data significantly improves LMs' capability on spatial understanding.
arXiv Detail & Related papers (2021-04-12T21:37:18Z) - Spatially Aware Multimodal Transformers for TextVQA [61.01618988620582]
We study the TextVQA task, i.e., reasoning about text in images to answer a question.
Existing approaches are limited in their use of spatial relations.
We propose a novel spatially aware self-attention layer.
arXiv Detail & Related papers (2020-07-23T17:20:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.