Related papers: Augmenting Visual Question Answering with Semantic Frame Information in a Multitask Learning Approach

Augmenting Visual Question Answering with Semantic Frame Information in a Multitask Learning Approach

URL: http://arxiv.org/abs/2001.11673v1
Date: Fri, 31 Jan 2020 06:31:39 GMT
Title: Augmenting Visual Question Answering with Semantic Frame Information in a Multitask Learning Approach
Authors: Mehrdad Alizadeh, Barbara Di Eugenio
Abstract summary: We propose a multitask CNN-LSTM VQA model that learns to classify the answers as well as the semantic frame elements. Our experiments show that semantic frame element classification helps the VQA system avoid inconsistent responses and improves performance.
Score: 1.827510863075184
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual Question Answering (VQA) concerns providing answers to Natural Language questions about images. Several deep neural network approaches have been proposed to model the task in an end-to-end fashion. Whereas the task is grounded in visual processing, if the question focuses on events described by verbs, the language understanding component becomes crucial. Our hypothesis is that models should be aware of verb semantics, as expressed via semantic role labels, argument types, and/or frame elements. Unfortunately, no VQA dataset exists that includes verb semantic information. Our first contribution is a new VQA dataset (imSituVQA) that we built by taking advantage of the imSitu annotations. The imSitu dataset consists of images manually labeled with semantic frame elements, mostly taken from FrameNet. Second, we propose a multitask CNN-LSTM VQA model that learns to classify the answers as well as the semantic frame elements. Our experiments show that semantic frame element classification helps the VQA system avoid inconsistent responses and improves performance.

Related papers

Spoken question answering for visual queries [14.834200714168546]
This work aims to create a system that enables user interaction through both speech and images.<n>The resulting multi-modal model has textual, visual, and spoken inputs and can answer spoken questions on images.
arXiv Detail & Related papers (2025-05-29T10:06:48Z)
SADL: An Effective In-Context Learning Method for Compositional Visual QA [22.0603596548686]
Large vision-language models (LVLMs) offer a novel capability for performing in-context learning (ICL) in Visual QA. This paper introduces SADL, a new visual-linguistic prompting framework for the task.
arXiv Detail & Related papers (2024-07-02T06:41:39Z)
Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched Prompts [54.072432123447854]
Visual question answering (VQA) is the task of answering questions about an image. Answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image. We propose a framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately.
arXiv Detail & Related papers (2023-10-31T03:54:11Z)
Open-Set Knowledge-Based Visual Question Answering with Inference Paths [79.55742631375063]
The purpose of Knowledge-Based Visual Question Answering (KB-VQA) is to provide a correct answer to the question with the aid of external knowledge bases. We propose a new retriever-ranker paradigm of KB-VQA, Graph pATH rankER (GATHER for brevity) Specifically, it contains graph constructing, pruning, and path-level ranking, which not only retrieves accurate answers but also provides inference paths that explain the reasoning process.
arXiv Detail & Related papers (2023-10-12T09:12:50Z)
Syntax Tree Constrained Graph Network for Visual Question Answering [14.059645822205718]
Visual Question Answering (VQA) aims to automatically answer natural language questions related to given image content. We propose a novel Syntax Tree Constrained Graph Network (STCGN) for VQA based on entity message passing and syntax tree. We then design a message-passing mechanism for phrase-aware visual entities and capture entity features according to a given visual context.
arXiv Detail & Related papers (2023-09-17T07:03:54Z)
Making the V in Text-VQA Matter [1.2962828085662563]
Text-based VQA aims at answering questions by reading the text present in the images. Recent studies have shown that the question-answer pairs in the dataset are more focused on the text present in the image. The models trained on this dataset predict biased answers due to the lack of understanding of visual context.
arXiv Detail & Related papers (2023-08-01T05:28:13Z)
Conversational Semantic Parsing using Dynamic Context Graphs [68.72121830563906]
We consider the task of conversational semantic parsing over general purpose knowledge graphs (KGs) with millions of entities, and thousands of relation-types. We focus on models which are capable of interactively mapping user utterances into executable logical forms.
arXiv Detail & Related papers (2023-05-04T16:04:41Z)
Semantic Parsing for Conversational Question Answering over Knowledge Graphs [63.939700311269156]
We develop a dataset where user questions are annotated with Sparql parses and system answers correspond to execution results thereof. We present two different semantic parsing approaches and highlight the challenges of the task. Our dataset and models are released at https://github.com/Edinburgh/SPICE.
arXiv Detail & Related papers (2023-01-28T14:45:11Z)
Text-Aware Dual Routing Network for Visual Question Answering [11.015339851906287]
Existing approaches often fail in cases that require reading and understanding text in images to answer questions. We propose a Text-Aware Dual Routing Network (TDR) which simultaneously handles the VQA cases with and without understanding text information in the input images. In the branch that involves text understanding, we incorporate the Optical Character Recognition (OCR) features into the model to help understand the text in the images.
arXiv Detail & Related papers (2022-11-17T02:02:11Z)
A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge [39.788346536244504]
A-OKVQA is a crowdsourced dataset composed of about 25K questions. We demonstrate the potential of this new dataset through a detailed analysis of its contents.
arXiv Detail & Related papers (2022-06-03T17:52:27Z)
CapWAP: Captioning with a Purpose [56.99405135645775]
We propose a new task, Captioning with a Purpose (CapWAP) Our goal is to develop systems that can be tailored to be useful for the information needs of an intended population. We show that it is possible to use reinforcement learning to directly optimize for the intended information need.
arXiv Detail & Related papers (2020-11-09T09:23:55Z)
Visual Question Answering with Prior Class Semantics [50.845003775809836]
We show how to exploit additional information pertaining to the semantics of candidate answers. We extend the answer prediction process with a regression objective in a semantic space. Our method brings improvements in consistency and accuracy over a range of question types.
arXiv Detail & Related papers (2020-05-04T02:46:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.