Related papers: Zero-shot Translation of Attention Patterns in VQA Models to Natural Language

Zero-shot Translation of Attention Patterns in VQA Models to Natural Language

URL: http://arxiv.org/abs/2311.05043v1
Date: Wed, 8 Nov 2023 22:18:53 GMT
Title: Zero-shot Translation of Attention Patterns in VQA Models to Natural Language
Authors: Leonard Salewski, A. Sophia Koepke, Hendrik P. A. Lensch, Zeynep Akata
Abstract summary: ZS-A2T is a framework that translates the transformer attention of a given model into natural language without requiring any training. We consider this in the context of Visual Question Answering (VQA) Our framework does not require any training and allows the drop-in replacement of different guiding sources.
Score: 65.94419474119162
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Converting a model's internals to text can yield human-understandable insights about the model. Inspired by the recent success of training-free approaches for image captioning, we propose ZS-A2T, a zero-shot framework that translates the transformer attention of a given model into natural language without requiring any training. We consider this in the context of Visual Question Answering (VQA). ZS-A2T builds on a pre-trained large language model (LLM), which receives a task prompt, question, and predicted answer, as inputs. The LLM is guided to select tokens which describe the regions in the input image that the VQA model attended to. Crucially, we determine this similarity by exploiting the text-image matching capabilities of the underlying VQA model. Our framework does not require any training and allows the drop-in replacement of different guiding sources (e.g. attribution instead of attention maps), or language models. We evaluate this novel task on textual explanation datasets for VQA, giving state-of-the-art performances for the zero-shot setting on GQA-REX and VQA-X. Our code is available at: https://github.com/ExplainableML/ZS-A2T.

Related papers

Spoken question answering for visual queries [14.834200714168546]
This work aims to create a system that enables user interaction through both speech and images.<n>The resulting multi-modal model has textual, visual, and spoken inputs and can answer spoken questions on images.
arXiv Detail & Related papers (2025-05-29T10:06:48Z)
VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models [58.21452697997078]
We propose a novel VQAttack model, which can generate both image and text perturbations with the designed modules. Experimental results on two VQA datasets with five validated models demonstrate the effectiveness of the proposed VQAttack.
arXiv Detail & Related papers (2024-02-16T21:17:42Z)
ZYN: Zero-Shot Reward Models with Yes-No Questions for RLAIF [0.0]
We address the problem of directing the text generation of a language model towards a desired behavior. We propose using another, instruction-tuned language model as a critic reward model in a zero-shot way.
arXiv Detail & Related papers (2023-08-11T20:59:31Z)
Zero-shot Visual Question Answering with Language Model Feedback [83.65140324876536]
We propose a language model guided captioning approach, LAMOC, for knowledge-based visual question answering (VQA) Our approach employs the generated captions by a captioning model as the context of an answer prediction model, which is a Pre-trained Language model (PLM)
arXiv Detail & Related papers (2023-05-26T15:04:20Z)
Self-Chained Image-Language Model for Video Localization and Question Answering [66.86740990630433]
We propose Self-Chained Video-Answering (SeViLA) framework to tackle both temporal localization and QA on videos. SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2.
arXiv Detail & Related papers (2023-05-11T17:23:00Z)
Overcoming Language Priors in Visual Question Answering via Distinguishing Superficially Similar Instances [17.637150597493463]
We propose a novel training framework that explicitly encourages the VQA model to distinguish between the superficially similar instances. We exploit the proposed distinguishing module to increase the distance between the instance and its counterparts in the answer space. Experimental results show that our method achieves the state-of-the-art performance on VQA-CP v2.
arXiv Detail & Related papers (2022-09-18T10:30:44Z)
MUST-VQA: MUltilingual Scene-text VQA [7.687215328455748]
We consider the task of Scene Text Visual Question Answering (STVQA) in which the question can be asked in different languages. We show the effectiveness of adapting multilingual language models into STVQA tasks.
arXiv Detail & Related papers (2022-09-14T15:37:56Z)
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs. Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z)
A Good Prompt Is Worth Millions of Parameters? Low-resource Prompt-based Learning for Vision-Language Models [50.27305012063483]
FewVLM is a few-shot prompt-based learner on vision-language tasks. We pretrain a sequence-to-sequence Transformer model with both prefix language modeling (PrefixLM) and masked language modeling (MaskedLM) We observe that prompts significantly affect zero-shot performance but marginally affect few-shot performance.
arXiv Detail & Related papers (2021-10-16T06:07:59Z)
Image Captioning for Effective Use of Language Models in Knowledge-Based Visual Question Answering [17.51860125438028]
We propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models. Our results on a visual question answering task which requires external knowledge (OK-VQA) show that our text-only model outperforms pretrained multimodal (image-text) models of comparable number of parameters.
arXiv Detail & Related papers (2021-09-15T14:11:29Z)
Learning from Lexical Perturbations for Consistent Visual Question Answering [78.21912474223926]
Existing Visual Question Answering (VQA) models are often fragile and sensitive to input variations. We propose a novel approach to address this issue based on modular networks, which creates two questions related by linguistic perturbations. We also present VQA Perturbed Pairings (VQA P2), a new, low-cost benchmark and augmentation pipeline to create controllable linguistic variations.
arXiv Detail & Related papers (2020-11-26T17:38:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.