Understanding Attention for Vision-and-Language Tasks
- URL: http://arxiv.org/abs/2208.08104v1
- Date: Wed, 17 Aug 2022 06:45:07 GMT
- Title: Understanding Attention for Vision-and-Language Tasks
- Authors: Feiqi Cao, Soyeon Caren Han, Siqu Long, Changwei Xu, Josiah Poon
- Abstract summary: We conduct a comprehensive analysis on understanding the role of attention alignment by looking into the attention score calculation methods.
We also analyse the conditions which attention score calculation mechanism would be more (or less) interpretable.
Our analysis is the first of its kind and provides useful insights of the importance of each attention alignment score calculation when applied at the training phase of VL tasks.
- Score: 4.752823994295959
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention mechanism has been used as an important component across
Vision-and-Language(VL) tasks in order to bridge the semantic gap between
visual and textual features. While attention has been widely used in VL tasks,
it has not been examined the capability of different attention alignment
calculation in bridging the semantic gap between visual and textual clues. In
this research, we conduct a comprehensive analysis on understanding the role of
attention alignment by looking into the attention score calculation methods and
check how it actually represents the visual region's and textual token's
significance for the global assessment. We also analyse the conditions which
attention score calculation mechanism would be more (or less) interpretable,
and which may impact the model performance on three different VL tasks,
including visual question answering, text-to-image generation, text-and-image
matching (both sentence and image retrieval). Our analysis is the first of its
kind and provides useful insights of the importance of each attention alignment
score calculation when applied at the training phase of VL tasks, commonly
ignored in attention-based cross modal models, and/or pretrained models.
Related papers
- VISTA: A Visual and Textual Attention Dataset for Interpreting Multimodal Models [2.0718016474717196]
integrated Vision and Language Models (VLMs) are frequently regarded as black boxes within the machine learning research community.
We present an image-text aligned human visual attention dataset that maps specific associations between image regions and corresponding text segments.
We then compare the internal heatmaps generated by VL models with this dataset, allowing us to analyze and better understand the model's decision-making process.
arXiv Detail & Related papers (2024-10-06T20:11:53Z) - Why context matters in VQA and Reasoning: Semantic interventions for VLM input modalities [18.859309032300402]
We investigate how the integration of information from image and text modalities influences the performance and behavior of Visual Language Model (VLM) predictions.
We study the interplay between text and image modalities in different configurations where visual content is essential for solving the VQA task.
Our results show that complementary information between modalities improves answer and reasoning quality, while contradictory information harms model performance and confidence.
arXiv Detail & Related papers (2024-10-02T16:02:02Z) - Top-Down Visual Attention from Analysis by Synthesis [87.47527557366593]
We consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision.
We propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and controllable achieves top-down attention.
arXiv Detail & Related papers (2023-03-23T05:17:05Z) - Alignment Attention by Matching Key and Query Distributions [48.93793773929006]
This paper introduces alignment attention that explicitly encourages self-attention to match the distributions of the key and query within each head.
It is simple to convert any models with self-attention, including pre-trained ones, to the proposed alignment attention.
On a variety of language understanding tasks, we show the effectiveness of our method in accuracy, uncertainty estimation, generalization across domains, and robustness to adversarial attacks.
arXiv Detail & Related papers (2021-10-25T00:54:57Z) - Variational Structured Attention Networks for Deep Visual Representation
Learning [49.80498066480928]
We propose a unified deep framework to jointly learn both spatial attention maps and channel attention in a principled manner.
Specifically, we integrate the estimation and the interaction of the attentions within a probabilistic representation learning framework.
We implement the inference rules within the neural network, thus allowing for end-to-end learning of the probabilistic and the CNN front-end parameters.
arXiv Detail & Related papers (2021-03-05T07:37:24Z) - Linguistically-aware Attention for Reducing the Semantic-Gap in
Vision-Language Tasks [9.462808515258464]
We propose an attention mechanism - Linguistically-aware Attention (LAT) - that leverages object attributes obtained from generic object detectors.
LAT represents visual and textual modalities in a common linguistically-rich space, thus providing linguistic awareness to the attention process.
We apply and demonstrate the effectiveness of LAT in three Vision-language (V-L) tasks: Counting-VQA, VQA, and Image captioning.
arXiv Detail & Related papers (2020-08-18T16:29:49Z) - Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z) - Cross-Modality Relevance for Reasoning on Language and Vision [22.41781462637622]
This work deals with the challenge of learning and reasoning over language and vision data for the related downstream tasks such as visual question answering (VQA) and natural language for visual reasoning (NLVR)
We design a novel cross-modality relevance module that is used in an end-to-end framework to learn the relevance representation between components of various input modalities under the supervision of a target task.
Our proposed approach shows competitive performance on two different language and vision tasks using public benchmarks and improves the state-of-the-art published results.
arXiv Detail & Related papers (2020-05-12T20:17:25Z) - A Novel Attention-based Aggregation Function to Combine Vision and
Language [55.7633883960205]
We propose a novel fully-attentive reduction method for vision and language.
Specifically, our approach computes a set of scores for each element of each modality employing a novel variant of cross-attention.
We test our approach on image-text matching and visual question answering, building fair comparisons with other reduction choices.
arXiv Detail & Related papers (2020-04-27T18:09:46Z) - Salience Estimation with Multi-Attention Learning for Abstractive Text
Summarization [86.45110800123216]
In the task of text summarization, salience estimation for words, phrases or sentences is a critical component.
We propose a Multi-Attention Learning framework which contains two new attention learning components for salience estimation.
arXiv Detail & Related papers (2020-04-07T02:38:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.