LRTA: A Transparent Neural-Symbolic Reasoning Framework with Modular
Supervision for Visual Question Answering
- URL: http://arxiv.org/abs/2011.10731v1
- Date: Sat, 21 Nov 2020 06:39:42 GMT
- Title: LRTA: A Transparent Neural-Symbolic Reasoning Framework with Modular
Supervision for Visual Question Answering
- Authors: Weixin Liang, Feiyang Niu, Aishwarya Reganti, Govind Thattai, Gokhan
Tur
- Abstract summary: We propose a transparent neural-symbolic reasoning framework for visual question answering.
It solves the problem step-by-step like humans and provides human-readable form of justification at each step.
Our experiments on GQA dataset show that LRTA outperforms the state-of-the-art model by a large margin.
- Score: 4.602329567377897
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The predominant approach to visual question answering (VQA) relies on
encoding the image and question with a "black-box" neural encoder and decoding
a single token as the answer like "yes" or "no". Despite this approach's strong
quantitative results, it struggles to come up with intuitive, human-readable
forms of justification for the prediction process. To address this
insufficiency, we reformulate VQA as a full answer generation task, which
requires the model to justify its predictions in natural language. We propose
LRTA [Look, Read, Think, Answer], a transparent neural-symbolic reasoning
framework for visual question answering that solves the problem step-by-step
like humans and provides human-readable form of justification at each step.
Specifically, LRTA learns to first convert an image into a scene graph and
parse a question into multiple reasoning instructions. It then executes the
reasoning instructions one at a time by traversing the scene graph using a
recurrent neural-symbolic execution module. Finally, it generates a full answer
to the given question with natural language justifications. Our experiments on
GQA dataset show that LRTA outperforms the state-of-the-art model by a large
margin (43.1% v.s. 28.0%) on the full answer generation task. We also create a
perturbed GQA test set by removing linguistic cues (attributes and relations)
in the questions for analyzing whether a model is having a smart guess with
superficial data correlations. We show that LRTA makes a step towards truly
understanding the question while the state-of-the-art model tends to learn
superficial correlations from the training data.
Related papers
- Large Vision-Language Models for Remote Sensing Visual Question Answering [0.0]
Remote Sensing Visual Question Answering (RSVQA) is a challenging task that involves interpreting complex satellite imagery to answer natural language questions.
Traditional approaches often rely on separate visual feature extractors and language processing models, which can be computationally intensive and limited in their ability to handle open-ended questions.
We propose a novel method that leverages a generative Large Vision-Language Model (LVLM) to streamline the RSVQA process.
arXiv Detail & Related papers (2024-11-16T18:32:38Z) - Integrating Large Language Models with Graph-based Reasoning for Conversational Question Answering [58.17090503446995]
We focus on a conversational question answering task which combines the challenges of understanding questions in context and reasoning over evidence gathered from heterogeneous sources like text, knowledge graphs, tables, and infoboxes.
Our method utilizes a graph structured representation to aggregate information about a question and its context.
arXiv Detail & Related papers (2024-06-14T13:28:03Z) - Open-Set Knowledge-Based Visual Question Answering with Inference Paths [79.55742631375063]
The purpose of Knowledge-Based Visual Question Answering (KB-VQA) is to provide a correct answer to the question with the aid of external knowledge bases.
We propose a new retriever-ranker paradigm of KB-VQA, Graph pATH rankER (GATHER for brevity)
Specifically, it contains graph constructing, pruning, and path-level ranking, which not only retrieves accurate answers but also provides inference paths that explain the reasoning process.
arXiv Detail & Related papers (2023-10-12T09:12:50Z) - Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z) - Weakly Supervised Visual Question Answer Generation [2.7605547688813172]
We present a weakly supervised method that synthetically generates question-answer pairs procedurally from visual information and captions.
We perform an exhaustive experimental analysis on VQA dataset and see that our model significantly outperforms SOTA methods on BLEU scores.
arXiv Detail & Related papers (2023-06-11T08:46:42Z) - Learn to Explain: Multimodal Reasoning via Thought Chains for Science
Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations.
We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA.
Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z) - NLX-GPT: A Model for Natural Language Explanations in Vision and
Vision-Language Tasks [18.13793282306575]
Natural language explanation (NLE) models aim at explaining the decision-making process of a black box system.
We introduce NLX-GPT, a general, compact and faithful language model that can simultaneously predict an answer and explain it.
We then address the problem of evaluating the explanations which can be in many times generic, data-biased and can come in several forms.
arXiv Detail & Related papers (2022-03-09T22:57:15Z) - Visual Question Answering based on Formal Logic [9.023122463034332]
In VQA, a series of questions are posed based on a set of images and the task at hand is to arrive at the answer.
We take a symbolic reasoning based approach using the framework of formal logic.
Our proposed method is highly interpretable and each step in the pipeline can be easily analyzed by a human.
arXiv Detail & Related papers (2021-11-08T19:43:53Z) - Understanding Unnatural Questions Improves Reasoning over Text [54.235828149899625]
Complex question answering (CQA) over raw text is a challenging task.
Learning an effective CQA model requires large amounts of human-annotated data.
We address the challenge of learning a high-quality programmer (parser) by projecting natural human-generated questions into unnatural machine-generated questions.
arXiv Detail & Related papers (2020-10-19T10:22:16Z) - Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning" [49.76230210108583]
We propose a framework to isolate and evaluate the reasoning aspect of visual question answering (VQA) separately from its perception.
We also propose a novel top-down calibration technique that allows the model to answer reasoning questions even with imperfect perception.
On the challenging GQA dataset, this framework is used to perform in-depth, disentangled comparisons between well-known VQA models.
arXiv Detail & Related papers (2020-06-20T08:48:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.