Syntax Tree Constrained Graph Network for Visual Question Answering
- URL: http://arxiv.org/abs/2309.09179v1
- Date: Sun, 17 Sep 2023 07:03:54 GMT
- Title: Syntax Tree Constrained Graph Network for Visual Question Answering
- Authors: Xiangrui Su, Qi Zhang, Chongyang Shi, Jiachang Liu, and Liang Hu
- Abstract summary: Visual Question Answering (VQA) aims to automatically answer natural language questions related to given image content.
We propose a novel Syntax Tree Constrained Graph Network (STCGN) for VQA based on entity message passing and syntax tree.
We then design a message-passing mechanism for phrase-aware visual entities and capture entity features according to a given visual context.
- Score: 14.059645822205718
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual Question Answering (VQA) aims to automatically answer natural language
questions related to given image content. Existing VQA methods integrate vision
modeling and language understanding to explore the deep semantics of the
question. However, these methods ignore the significant syntax information of
the question, which plays a vital role in understanding the essential semantics
of the question and guiding the visual feature refinement. To fill the gap, we
suggested a novel Syntax Tree Constrained Graph Network (STCGN) for VQA based
on entity message passing and syntax tree. This model is able to extract a
syntax tree from questions and obtain more precise syntax information.
Specifically, we parse questions and obtain the question syntax tree using the
Stanford syntax parsing tool. From the word level and phrase level, syntactic
phrase features and question features are extracted using a hierarchical tree
convolutional network. We then design a message-passing mechanism for
phrase-aware visual entities and capture entity features according to a given
visual context. Extensive experiments on VQA2.0 datasets demonstrate the
superiority of our proposed model.
Related papers
- Integrating Large Language Models with Graph-based Reasoning for Conversational Question Answering [58.17090503446995]
We focus on a conversational question answering task which combines the challenges of understanding questions in context and reasoning over evidence gathered from heterogeneous sources like text, knowledge graphs, tables, and infoboxes.
Our method utilizes a graph structured representation to aggregate information about a question and its context.
arXiv Detail & Related papers (2024-06-14T13:28:03Z) - Open-Set Knowledge-Based Visual Question Answering with Inference Paths [79.55742631375063]
The purpose of Knowledge-Based Visual Question Answering (KB-VQA) is to provide a correct answer to the question with the aid of external knowledge bases.
We propose a new retriever-ranker paradigm of KB-VQA, Graph pATH rankER (GATHER for brevity)
Specifically, it contains graph constructing, pruning, and path-level ranking, which not only retrieves accurate answers but also provides inference paths that explain the reasoning process.
arXiv Detail & Related papers (2023-10-12T09:12:50Z) - Making the V in Text-VQA Matter [1.2962828085662563]
Text-based VQA aims at answering questions by reading the text present in the images.
Recent studies have shown that the question-answer pairs in the dataset are more focused on the text present in the image.
The models trained on this dataset predict biased answers due to the lack of understanding of visual context.
arXiv Detail & Related papers (2023-08-01T05:28:13Z) - Conversational Semantic Parsing using Dynamic Context Graphs [68.72121830563906]
We consider the task of conversational semantic parsing over general purpose knowledge graphs (KGs) with millions of entities, and thousands of relation-types.
We focus on models which are capable of interactively mapping user utterances into executable logical forms.
arXiv Detail & Related papers (2023-05-04T16:04:41Z) - Semantic Parsing for Conversational Question Answering over Knowledge
Graphs [63.939700311269156]
We develop a dataset where user questions are annotated with Sparql parses and system answers correspond to execution results thereof.
We present two different semantic parsing approaches and highlight the challenges of the task.
Our dataset and models are released at https://github.com/Edinburgh/SPICE.
arXiv Detail & Related papers (2023-01-28T14:45:11Z) - Text-Aware Dual Routing Network for Visual Question Answering [11.015339851906287]
Existing approaches often fail in cases that require reading and understanding text in images to answer questions.
We propose a Text-Aware Dual Routing Network (TDR) which simultaneously handles the VQA cases with and without understanding text information in the input images.
In the branch that involves text understanding, we incorporate the Optical Character Recognition (OCR) features into the model to help understand the text in the images.
arXiv Detail & Related papers (2022-11-17T02:02:11Z) - Discourse Analysis via Questions and Answers: Parsing Dependency
Structures of Questions Under Discussion [57.43781399856913]
This work adopts the linguistic framework of Questions Under Discussion (QUD) for discourse analysis.
We characterize relationships between sentences as free-form questions, in contrast to exhaustive fine-grained questions.
We develop the first-of-its-kind QUD that derives a dependency structure of questions over full documents.
arXiv Detail & Related papers (2022-10-12T03:53:12Z) - Exploiting Rich Syntax for Better Knowledge Base Question Answering [13.890818931081405]
We propose an approach to learn syntax-based representations for Knowledge Base Question Answering.
First, we encode path-based syntax by considering the shortest dependency paths between keywords.
Then, we propose two encoding strategies to mode the information of whole syntactic trees to obtain tree-based syntax.
arXiv Detail & Related papers (2021-07-16T14:59:05Z) - Iterative Context-Aware Graph Inference for Visual Dialog [126.016187323249]
We propose a novel Context-Aware Graph (CAG) neural network.
Each node in the graph corresponds to a joint semantic feature, including both object-based (visual) and history-related (textual) context representations.
arXiv Detail & Related papers (2020-04-05T13:09:37Z) - Augmenting Visual Question Answering with Semantic Frame Information in
a Multitask Learning Approach [1.827510863075184]
We propose a multitask CNN-LSTM VQA model that learns to classify the answers as well as the semantic frame elements.
Our experiments show that semantic frame element classification helps the VQA system avoid inconsistent responses and improves performance.
arXiv Detail & Related papers (2020-01-31T06:31:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.