Related papers: Repulsive Attention: Rethinking Multi-head Attention as Bayesian Inference

Repulsive Attention: Rethinking Multi-head Attention as Bayesian Inference

URL: http://arxiv.org/abs/2009.09364v2
Date: Mon, 2 Nov 2020 02:22:48 GMT
Title: Repulsive Attention: Rethinking Multi-head Attention as Bayesian Inference
Authors: Bang An, Jie Lyu, Zhenyi Wang, Chunyuan Li, Changwei Hu, Fei Tan, Ruiyi Zhang, Yifan Hu, Changyou Chen
Abstract summary: We provide a novel understanding of multi-head attention from a Bayesian perspective. We propose a non-parametric approach that explicitly improves the repulsiveness in multi-head attention. Experiments on various attention models and applications demonstrate that the proposed repulsive attention can improve the learned feature diversity.
Score: 68.12511526813991
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The neural attention mechanism plays an important role in many natural language processing applications. In particular, the use of multi-head attention extends single-head attention by allowing a model to jointly attend information from different perspectives. Without explicit constraining, however, multi-head attention may suffer from attention collapse, an issue that makes different heads extract similar attentive features, thus limiting the model's representation power. In this paper, for the first time, we provide a novel understanding of multi-head attention from a Bayesian perspective. Based on the recently developed particle-optimization sampling techniques, we propose a non-parametric approach that explicitly improves the repulsiveness in multi-head attention and consequently strengthens model's expressiveness. Remarkably, our Bayesian interpretation provides theoretical inspirations on the not-well-understood questions: why and how one uses multi-head attention. Extensive experiments on various attention models and applications demonstrate that the proposed repulsive attention can improve the learned feature diversity, leading to more informative representations with consistent performance improvement on various tasks.

Related papers

Attention in Diffusion Model: A Survey [17.11612595063082]
This paper presents a comprehensive survey of attention within diffusion models. We systematically analyse its roles, design patterns, and operations across different modalities and tasks. We propose a unified taxonomy that categorises attention-related modifications into parts according to the structural components they affect.
arXiv Detail & Related papers (2025-04-01T09:00:49Z)
Bridging Generative and Discriminative Models for Unified Visual Perception with Diffusion Priors [56.82596340418697]
We propose a simple yet effective framework comprising a pre-trained Stable Diffusion (SD) model containing rich generative priors, a unified head (U-head) capable of integrating hierarchical representations, and an adapted expert providing discriminative priors. Comprehensive investigations unveil potential characteristics of Vermouth, such as varying granularity of perception concealed in latent variables at distinct time steps and various U-net stages. The promising results demonstrate the potential of diffusion models as formidable learners, establishing their significance in furnishing informative and robust visual representations.
arXiv Detail & Related papers (2024-01-29T10:36:57Z)
Improving Speech Emotion Recognition Through Focus and Calibration Attention Mechanisms [0.5994412766684842]
We identify misalignments between the attention and the signal amplitude in the existing multi-head self-attention. We propose to use a Focus-Attention (FA) mechanism and a novel-Attention (CA) mechanism in combination with the multi-head self-attention. By employing the CA mechanism, the network can modulate the information flow by assigning different weights to each attention head and improve the utilization of surrounding contexts.
arXiv Detail & Related papers (2022-08-21T08:04:22Z)
Self-Attention Neural Bag-of-Features [103.70855797025689]
We build on the recently introduced 2D-Attention and reformulate the attention learning methodology. We propose a joint feature-temporal attention mechanism that learns a joint 2D attention mask highlighting relevant information.
arXiv Detail & Related papers (2022-01-26T17:54:14Z)
Alignment Attention by Matching Key and Query Distributions [48.93793773929006]
This paper introduces alignment attention that explicitly encourages self-attention to match the distributions of the key and query within each head. It is simple to convert any models with self-attention, including pre-trained ones, to the proposed alignment attention. On a variety of language understanding tasks, we show the effectiveness of our method in accuracy, uncertainty estimation, generalization across domains, and robustness to adversarial attacks.
arXiv Detail & Related papers (2021-10-25T00:54:57Z)
Bayesian Attention Belief Networks [59.183311769616466]
Attention-based neural networks have achieved state-of-the-art results on a wide range of tasks. This paper introduces Bayesian attention belief networks, which construct a decoder network by modeling unnormalized attention weights. We show that our method outperforms deterministic attention and state-of-the-art attention in accuracy, uncertainty estimation, generalization across domains, and adversarial attacks.
arXiv Detail & Related papers (2021-06-09T17:46:22Z)
Improve the Interpretability of Attention: A Fast, Accurate, and Interpretable High-Resolution Attention Model [6.906621279967867]
We propose a novel Bilinear Representative Non-Parametric Attention (BR-NPA) strategy that captures the task-relevant human-interpretable information. The proposed model can be easily adapted in a wide variety of modern deep models, where classification is involved. It is also more accurate, faster, and with a smaller memory footprint than usual neural attention modules.
arXiv Detail & Related papers (2021-06-04T15:57:37Z)
How Far Does BERT Look At:Distance-based Clustering and Analysis of BERT$'$s Attention [20.191319097826266]
We cluster attention heatmaps into significantly different patterns through unsupervised clustering. Our proposed features can be used to explain and calibrate different attention heads in Transformer models.
arXiv Detail & Related papers (2020-11-02T12:52:31Z)
Collaborative Attention Mechanism for Multi-View Action Recognition [75.33062629093054]
We propose a collaborative attention mechanism (CAM) for solving the multi-view action recognition problem. The proposed CAM detects the attention differences among multi-view, and adaptively integrates frame-level information to benefit each other. Experiments on four action datasets illustrate the proposed CAM achieves better results for each view and also boosts multi-view performance.
arXiv Detail & Related papers (2020-09-14T17:33:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.