Found a Reason for me? Weakly-supervised Grounded Visual Question
Answering using Capsules
- URL: http://arxiv.org/abs/2105.04836v1
- Date: Tue, 11 May 2021 07:45:32 GMT
- Title: Found a Reason for me? Weakly-supervised Grounded Visual Question
Answering using Capsules
- Authors: Aisha Urooj Khan, Hilde Kuehne, Kevin Duarte, Chuang Gan, Niels Lobo,
Mubarak Shah
- Abstract summary: The problem of grounding VQA tasks has seen an increased attention in the research community recently.
We propose a visual capsule module with a query-based selection mechanism of capsule features.
We show that integrating the proposed capsule module in existing VQA systems significantly improves their performance on the weakly supervised grounding task.
- Score: 85.98177341704675
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The problem of grounding VQA tasks has seen an increased attention in the
research community recently, with most attempts usually focusing on solving
this task by using pretrained object detectors. However, pre-trained object
detectors require bounding box annotations for detecting relevant objects in
the vocabulary, which may not always be feasible for real-life large-scale
applications. In this paper, we focus on a more relaxed setting: the grounding
of relevant visual entities in a weakly supervised manner by training on the
VQA task alone. To address this problem, we propose a visual capsule module
with a query-based selection mechanism of capsule features, that allows the
model to focus on relevant regions based on the textual cues about visual
information in the question. We show that integrating the proposed capsule
module in existing VQA systems significantly improves their performance on the
weakly supervised grounding task. Overall, we demonstrate the effectiveness of
our approach on two state-of-the-art VQA systems, stacked NMN and MAC, on the
CLEVR-Answers benchmark, our new evaluation set based on CLEVR scenes with
ground truth bounding boxes for objects that are relevant for the correct
answer, as well as on GQA, a real world VQA dataset with compositional
questions. We show that the systems with the proposed capsule module
consistently outperform the respective baseline systems in terms of answer
grounding, while achieving comparable performance on VQA task.
Related papers
- Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection [82.65760006883248]
We introduce a new task named Change Detection Question Answering and Grounding (CDQAG)
CDQAG extends the traditional change detection task by providing interpretable textual answers and intuitive visual evidence.
We construct the first CDQAG benchmark dataset, termed QAG-360K, comprising over 360K triplets of questions, textual answers, and corresponding high-quality visual masks.
arXiv Detail & Related papers (2024-10-31T11:20:13Z) - Detect2Interact: Localizing Object Key Field in Visual Question Answering (VQA) with LLMs [5.891295920078768]
We introduce an advanced approach for fine-grained object visual key field detection.
First, we use the segment anything model (SAM) to generate detailed spatial maps of objects in images.
Next, we use Vision Studio to extract semantic object descriptions.
Third, we employ GPT-4's common sense knowledge, bridging the gap between an object's semantics and its spatial map.
arXiv Detail & Related papers (2024-04-01T14:53:36Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Can I Trust Your Answer? Visually Grounded Video Question Answering [88.11169242115416]
We study visually grounded VideoQA in response to the emerging trends of utilizing pretraining techniques for video-language understanding.
We construct NExT-GQA -- an extension of NExT-QA with 10.5$K$ temporal grounding labels tied to the original QA pairs.
arXiv Detail & Related papers (2023-09-04T03:06:04Z) - Where is my Wallet? Modeling Object Proposal Sets for Egocentric Visual
Query Localization [119.23191388798921]
This paper deals with the problem of localizing objects in image and video datasets from visual exemplars.
We first identify grave implicit biases in current query-conditioned model design and visual query datasets.
We propose a novel transformer-based module that allows for object-proposal set context to be considered.
arXiv Detail & Related papers (2022-11-18T22:50:50Z) - Visually Grounded VQA by Lattice-based Retrieval [24.298908211088072]
Visual Grounding (VG) in Visual Question Answering (VQA) systems describes how well a system manages to tie a question and its answer to relevant image regions.
In this work, we break with the dominant VQA modeling paradigm of classification and investigate VQA from the standpoint of an information retrieval task.
Our system operates over a weighted, directed, acyclic graph, a.k.a. "lattice", which is derived from the scene graph of a given image in conjunction with region-referring expressions extracted from the question.
arXiv Detail & Related papers (2022-11-15T12:12:08Z) - Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA
Task [12.74065821307626]
VQA is an ambitious task aiming to answer any image-related question.
It is hard to build such a system once for all since the needs of users are continuously updated.
We propose a real-data-free replay-based method tailored for CL on VQA, named Scene Graph as Prompt for Replay.
arXiv Detail & Related papers (2022-08-24T12:00:02Z) - Weakly Supervised Grounding for VQA in Vision-Language Transformers [112.5344267669495]
This paper focuses on the problem of weakly supervised grounding in context of visual question answering in transformers.
The approach leverages capsules by grouping each visual token in the visual encoder.
We evaluate our approach on the challenging GQA as well as VQA-HAT dataset for VQA grounding.
arXiv Detail & Related papers (2022-07-05T22:06:03Z) - From Easy to Hard: Learning Language-guided Curriculum for Visual
Question Answering on Remote Sensing Data [27.160303686163164]
Visual question answering (VQA) for remote sensing scene has great potential in intelligent human-computer interaction system.
No object annotations are available in RSVQA datasets, which makes it difficult for models to exploit informative region representation.
There are questions with clearly different difficulty levels for each image in the RSVQA task.
A multi-level visual feature learning method is proposed to jointly extract language-guided holistic and regional image features.
arXiv Detail & Related papers (2022-05-06T11:37:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.