Visual Question Answering Using Semantic Information from Image
Descriptions
- URL: http://arxiv.org/abs/2004.10966v2
- Date: Sat, 3 Apr 2021 18:09:22 GMT
- Title: Visual Question Answering Using Semantic Information from Image
Descriptions
- Authors: Tasmia Tasrin, Md Sultan Al Nahian and Brent Harrison
- Abstract summary: We propose a deep neural architecture that uses an attention mechanism which utilizes region based image features, the natural language question asked, and semantic knowledge extracted from the regions of an image to produce open-ended answers for questions asked in a visual question answering (VQA) task.
- Score: 2.6519061087638014
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we propose a deep neural architecture that uses an attention
mechanism which utilizes region based image features, the natural language
question asked, and semantic knowledge extracted from the regions of an image
to produce open-ended answers for questions asked in a visual question
answering (VQA) task. The combination of both region based features and region
based textual information about the image bolsters a model to more accurately
respond to questions and potentially do so with less required training data. We
evaluate our proposed architecture on a VQA task against a strong baseline and
show that our method achieves excellent results on this task.
Related papers
- Segmentation-guided Attention for Visual Question Answering from Remote Sensing Images [1.6932802756478726]
Visual Question Answering for Remote Sensing (RSVQA) is a task that aims at answering natural language questions about the content of a remote sensing image.
We propose to embed an attention mechanism guided by segmentation into a RSVQA pipeline.
We provide a new VQA dataset that exploits very high-resolution RGB orthophotos annotated with 16 segmentation classes and question/answer pairs.
arXiv Detail & Related papers (2024-07-11T16:59:32Z) - Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference.
We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z) - Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts [3.6064695344878093]
Visual question answering (VQA) is known as an AI-complete task as it requires understanding, reasoning, and inferring about the vision and the language content.
This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline.
arXiv Detail & Related papers (2024-04-12T16:35:23Z) - Question-Answer Cross Language Image Matching for Weakly Supervised
Semantic Segmentation [37.15828464616587]
Class Activation Map (CAM) has emerged as a popular tool for weakly supervised semantic segmentation.
We propose a novel Question-Answer Cross-Language-Image Matching framework for WSSS (QA-CLIMS)
arXiv Detail & Related papers (2024-01-18T10:55:13Z) - Language Guided Visual Question Answering: Elevate Your Multimodal
Language Model Using Knowledge-Enriched Prompts [54.072432123447854]
Visual question answering (VQA) is the task of answering questions about an image.
Answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image.
We propose a framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately.
arXiv Detail & Related papers (2023-10-31T03:54:11Z) - From Pixels to Objects: Cubic Visual Attention for Visual Question
Answering [132.95819467484517]
Recently, attention-based Visual Question Answering (VQA) has achieved great success by utilizing question to target different visual areas that are related to the answer.
We propose a Cubic Visual Attention (CVA) model by successfully applying a novel channel and spatial attention on object regions to improve VQA task.
Experimental results show that our proposed method significantly outperforms the state-of-the-arts.
arXiv Detail & Related papers (2022-06-04T07:03:18Z) - REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual
Question Answering [75.53187719777812]
This paper revisits visual representation in knowledge-based visual question answering (VQA)
We propose a new knowledge-based VQA method REVIVE, which tries to utilize the explicit information of object regions.
We achieve new state-of-the-art performance, i.e., 58.0% accuracy, surpassing previous state-of-the-art method by a large margin.
arXiv Detail & Related papers (2022-06-02T17:59:56Z) - MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media
Knowledge Extraction and Grounding [131.8797942031366]
We present a new QA evaluation benchmark with 1,384 questions over news articles that require cross-media grounding of objects in images onto text.
Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question.
We introduce a novel multimedia data augmentation framework, based on cross-media knowledge extraction and synthetic question-answer generation, to automatically augment data that can provide weak supervision for this task.
arXiv Detail & Related papers (2021-12-20T18:23:30Z) - Coarse-to-Fine Reasoning for Visual Question Answering [18.535633096397397]
We present a new reasoning framework to fill the gap between visual features and semantic clues in the Visual Question Answering (VQA) task.
Our method first extracts the features and predicates from the image and question.
We then propose a new reasoning framework to effectively jointly learn these features and predicates in a coarse-to-fine manner.
arXiv Detail & Related papers (2021-10-06T06:29:52Z) - CapWAP: Captioning with a Purpose [56.99405135645775]
We propose a new task, Captioning with a Purpose (CapWAP)
Our goal is to develop systems that can be tailored to be useful for the information needs of an intended population.
We show that it is possible to use reinforcement learning to directly optimize for the intended information need.
arXiv Detail & Related papers (2020-11-09T09:23:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.