Knowledge Detection by Relevant Question and Image Attributes in Visual
Question Answering
- URL: http://arxiv.org/abs/2306.04938v1
- Date: Thu, 8 Jun 2023 05:08:32 GMT
- Title: Knowledge Detection by Relevant Question and Image Attributes in Visual
Question Answering
- Authors: Param Ahir, Dr. Hiteishi Diwanji
- Abstract summary: Visual question answering (VQA) is a multidisciplinary research problem that pursued through practices of natural language processing and computer vision.
Our proposed method takes image attributes and question features as input for knowledge derivation module and retrieves only question relevant knowledge about image objects which can provide accurate answers.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual question answering (VQA) is a Multidisciplinary research problem that
pursued through practices of natural language processing and computer vision.
Visual question answering automatically answers natural language questions
according to the content of an image. Some testing questions require external
knowledge to derive a solution. Such knowledge-based VQA uses various methods
to retrieve features of image and text, and combine them to generate the
answer. To generate knowledgebased answers either question dependent or image
dependent knowledge retrieval methods are used. If knowledge about all the
objects in the image is derived, then not all knowledge is relevant to the
question. On other side only question related knowledge may lead to incorrect
answers and over trained model that answers question that is irrelevant to
image. Our proposed method takes image attributes and question features as
input for knowledge derivation module and retrieves only question relevant
knowledge about image objects which can provide accurate answers.
Related papers
- Language Guided Visual Question Answering: Elevate Your Multimodal
Language Model Using Knowledge-Enriched Prompts [54.072432123447854]
Visual question answering (VQA) is the task of answering questions about an image.
Answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image.
We propose a framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately.
arXiv Detail & Related papers (2023-10-31T03:54:11Z) - ChiQA: A Large Scale Image-based Real-World Question Answering Dataset
for Multi-Modal Understanding [42.5118058527339]
ChiQA contains more than 40K questions and more than 200K question-images pairs.
ChiQA requires a deep understanding of both language and vision, including grounding, comparisons, and reading.
We evaluate several state-of-the-art visual-language models such as ALBEF, demonstrating that there is still a large room for improvements on ChiQA.
arXiv Detail & Related papers (2022-08-05T07:55:28Z) - K-VQG: Knowledge-aware Visual Question Generation for Common-sense
Acquisition [64.55573343404572]
We present a novel knowledge-aware VQG dataset called K-VQG.
This is the first large, humanly annotated dataset in which questions regarding images are tied to structured knowledge.
We also develop a new VQG model that can encode and use knowledge as the target for a question.
arXiv Detail & Related papers (2022-03-15T13:38:10Z) - Can Open Domain Question Answering Systems Answer Visual Knowledge
Questions? [7.442099405543527]
We observe that many visual questions, which contain deictic referential phrases referring to entities in the image, can be rewritten as "non-grounded" questions.
This allows for the reuse of existing text-based Open Domain Question Answering (QA) Systems for visual question answering.
We propose a potentially data-efficient approach that reuses existing systems for (a) image analysis, (b) question rewriting, and (c) text-based question answering to answer such visual questions.
arXiv Detail & Related papers (2022-02-09T06:47:40Z) - An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA [51.639880603821446]
We propose PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions for knowledge-based VQA.
We first convert the image into captions (or tags) that GPT-3 can understand, then adapt GPT-3 to solve the VQA task in a few-shot manner.
By using only 16 examples, PICa surpasses the supervised state of the art by an absolute +8.6 points on the OK-VQA dataset.
arXiv Detail & Related papers (2021-09-10T17:51:06Z) - Multi-Modal Answer Validation for Knowledge-Based VQA [44.80209704315099]
We propose Multi-modal Answer Validation using External knowledge (MAVEx)
The idea is to validate a set of promising answer candidates based on answer-specific knowledge retrieval.
Our experiments with OK-VQA, a challenging knowledge-based VQA dataset, demonstrate that MAVEx achieves new state-of-the-art results.
arXiv Detail & Related papers (2021-03-23T00:49:36Z) - KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain
Knowledge-Based VQA [107.7091094498848]
One of the most challenging question types in VQA is when answering the question requires outside knowledge not present in the image.
In this work we study open-domain knowledge, the setting when the knowledge required to answer a question is not given/annotated, neither at training nor test time.
We tap into two types of knowledge representations and reasoning. First, implicit knowledge which can be learned effectively from unsupervised language pre-training and supervised training data with transformer-based models.
arXiv Detail & Related papers (2020-12-20T20:13:02Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z) - Generating Natural Questions from Images for Multimodal Assistants [4.930442416763205]
We present an approach for generating diverse and meaningful questions that consider image content and metadata of image.
We evaluate our approach using standard evaluation metrics such as BLEU, METEOR, ROUGE, and CIDEr.
arXiv Detail & Related papers (2020-11-17T19:12:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.