Weakly Supervised Grounding for VQA in Vision-Language Transformers
- URL: http://arxiv.org/abs/2207.02334v1
- Date: Tue, 5 Jul 2022 22:06:03 GMT
- Title: Weakly Supervised Grounding for VQA in Vision-Language Transformers
- Authors: Aisha Urooj Khan, Hilde Kuehne, Chuang Gan, Niels Da Vitoria Lobo,
Mubarak Shah
- Abstract summary: This paper focuses on the problem of weakly supervised grounding in context of visual question answering in transformers.
The approach leverages capsules by grouping each visual token in the visual encoder.
We evaluate our approach on the challenging GQA as well as VQA-HAT dataset for VQA grounding.
- Score: 112.5344267669495
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers for visual-language representation learning have been getting a
lot of interest and shown tremendous performance on visual question answering
(VQA) and grounding. But most systems that show good performance of those tasks
still rely on pre-trained object detectors during training, which limits their
applicability to the object classes available for those detectors. To mitigate
this limitation, the following paper focuses on the problem of weakly
supervised grounding in context of visual question answering in transformers.
The approach leverages capsules by grouping each visual token in the visual
encoder and uses activations from language self-attention layers as a
text-guided selection module to mask those capsules before they are forwarded
to the next layer. We evaluate our approach on the challenging GQA as well as
VQA-HAT dataset for VQA grounding. Our experiments show that: while removing
the information of masked objects from standard transformer architectures leads
to a significant drop in performance, the integration of capsules significantly
improves the grounding ability of such systems and provides new
state-of-the-art results compared to other approaches in the field.
Related papers
- Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection [82.65760006883248]
We introduce a new task named Change Detection Question Answering and Grounding (CDQAG)
CDQAG extends the traditional change detection task by providing interpretable textual answers and intuitive visual evidence.
We construct the first CDQAG benchmark dataset, termed QAG-360K, comprising over 360K triplets of questions, textual answers, and corresponding high-quality visual masks.
arXiv Detail & Related papers (2024-10-31T11:20:13Z) - Learning to Ground VLMs without Forgetting [54.033346088090674]
We introduce LynX, a framework that equips pretrained Visual Language Models with visual grounding ability without forgetting their existing image and language understanding skills.
To train the model effectively, we generate a high-quality synthetic dataset we call SCouT, which mimics human reasoning in visual grounding.
We evaluate LynX on several object detection and visual grounding datasets, demonstrating strong performance in object detection, zero-shot localization and grounded reasoning.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR [51.72751335574947]
Visual Commonsense Reasoning (VCR) calls for explanatory reasoning behind question answering over visual scenes.
Progress on the benchmark dataset stems largely from the recent advancement of Vision-Language Transformers (VL Transformers)
This paper posits that the VL Transformers do not exhibit visual commonsense, which is the key to VCR.
arXiv Detail & Related papers (2024-05-27T08:26:58Z) - Learning from Visual Observation via Offline Pretrained State-to-Go
Transformer [29.548242447584194]
We propose a two-stage framework for learning from visual observation.
In the first stage, we pretrain State-to-Go Transformer offline to predict and differentiate latent transitions of demonstrations.
In the second stage, the STG Transformer provides intrinsic rewards for downstream reinforcement learning tasks.
arXiv Detail & Related papers (2023-06-22T13:14:59Z) - Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics.
We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention.
We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z) - Found a Reason for me? Weakly-supervised Grounded Visual Question
Answering using Capsules [85.98177341704675]
The problem of grounding VQA tasks has seen an increased attention in the research community recently.
We propose a visual capsule module with a query-based selection mechanism of capsule features.
We show that integrating the proposed capsule module in existing VQA systems significantly improves their performance on the weakly supervised grounding task.
arXiv Detail & Related papers (2021-05-11T07:45:32Z) - Visual Grounding with Transformers [43.40192909920495]
Our approach is built on top of a transformer encoder-decoder and is independent of any pretrained detectors or word embedding models.
Our method outperforms state-of-the-art proposal-free approaches by a considerable margin on five benchmarks.
arXiv Detail & Related papers (2021-05-10T11:46:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.