Zero-shot Translation of Attention Patterns in VQA Models to Natural
Language
- URL: http://arxiv.org/abs/2311.05043v1
- Date: Wed, 8 Nov 2023 22:18:53 GMT
- Title: Zero-shot Translation of Attention Patterns in VQA Models to Natural
Language
- Authors: Leonard Salewski, A. Sophia Koepke, Hendrik P. A. Lensch, Zeynep Akata
- Abstract summary: ZS-A2T is a framework that translates the transformer attention of a given model into natural language without requiring any training.
We consider this in the context of Visual Question Answering (VQA)
Our framework does not require any training and allows the drop-in replacement of different guiding sources.
- Score: 65.94419474119162
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Converting a model's internals to text can yield human-understandable
insights about the model. Inspired by the recent success of training-free
approaches for image captioning, we propose ZS-A2T, a zero-shot framework that
translates the transformer attention of a given model into natural language
without requiring any training. We consider this in the context of Visual
Question Answering (VQA). ZS-A2T builds on a pre-trained large language model
(LLM), which receives a task prompt, question, and predicted answer, as inputs.
The LLM is guided to select tokens which describe the regions in the input
image that the VQA model attended to. Crucially, we determine this similarity
by exploiting the text-image matching capabilities of the underlying VQA model.
Our framework does not require any training and allows the drop-in replacement
of different guiding sources (e.g. attribution instead of attention maps), or
language models. We evaluate this novel task on textual explanation datasets
for VQA, giving state-of-the-art performances for the zero-shot setting on
GQA-REX and VQA-X. Our code is available at:
https://github.com/ExplainableML/ZS-A2T.
Related papers
- VQAttack: Transferable Adversarial Attacks on Visual Question Answering
via Pre-trained Models [58.21452697997078]
We propose a novel VQAttack model, which can generate both image and text perturbations with the designed modules.
Experimental results on two VQA datasets with five validated models demonstrate the effectiveness of the proposed VQAttack.
arXiv Detail & Related papers (2024-02-16T21:17:42Z) - ZYN: Zero-Shot Reward Models with Yes-No Questions for RLAIF [0.0]
We address the problem of directing the text generation of a language model towards a desired behavior.
We propose using another, instruction-tuned language model as a critic reward model in a zero-shot way.
arXiv Detail & Related papers (2023-08-11T20:59:31Z) - Zero-shot Visual Question Answering with Language Model Feedback [83.65140324876536]
We propose a language model guided captioning approach, LAMOC, for knowledge-based visual question answering (VQA)
Our approach employs the generated captions by a captioning model as the context of an answer prediction model, which is a Pre-trained Language model (PLM)
arXiv Detail & Related papers (2023-05-26T15:04:20Z) - Self-Chained Image-Language Model for Video Localization and Question
Answering [66.86740990630433]
We propose Self-Chained Video-Answering (SeViLA) framework to tackle both temporal localization and QA on videos.
SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2.
arXiv Detail & Related papers (2023-05-11T17:23:00Z) - Overcoming Language Priors in Visual Question Answering via
Distinguishing Superficially Similar Instances [17.637150597493463]
We propose a novel training framework that explicitly encourages the VQA model to distinguish between the superficially similar instances.
We exploit the proposed distinguishing module to increase the distance between the instance and its counterparts in the answer space.
Experimental results show that our method achieves the state-of-the-art performance on VQA-CP v2.
arXiv Detail & Related papers (2022-09-18T10:30:44Z) - MUST-VQA: MUltilingual Scene-text VQA [7.687215328455748]
We consider the task of Scene Text Visual Question Answering (STVQA) in which the question can be asked in different languages.
We show the effectiveness of adapting multilingual language models into STVQA tasks.
arXiv Detail & Related papers (2022-09-14T15:37:56Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Image Captioning for Effective Use of Language Models in Knowledge-Based
Visual Question Answering [17.51860125438028]
We propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models.
Our results on a visual question answering task which requires external knowledge (OK-VQA) show that our text-only model outperforms pretrained multimodal (image-text) models of comparable number of parameters.
arXiv Detail & Related papers (2021-09-15T14:11:29Z) - Learning from Lexical Perturbations for Consistent Visual Question
Answering [78.21912474223926]
Existing Visual Question Answering (VQA) models are often fragile and sensitive to input variations.
We propose a novel approach to address this issue based on modular networks, which creates two questions related by linguistic perturbations.
We also present VQA Perturbed Pairings (VQA P2), a new, low-cost benchmark and augmentation pipeline to create controllable linguistic variations.
arXiv Detail & Related papers (2020-11-26T17:38:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.