Finding the Pillars of Strength for Multi-Head Attention
- URL: http://arxiv.org/abs/2305.14380v2
- Date: Sun, 15 Oct 2023 04:10:13 GMT
- Title: Finding the Pillars of Strength for Multi-Head Attention
- Authors: Jinjie Ni, Rui Mao, Zonglin Yang, Han Lei, Erik Cambria
- Abstract summary: Recent studies have revealed some issues of Multi-Head Attention (MHA)
We propose Grouped Head Attention, trained with a self-supervised group constraint that group attention heads.
We additionally propose a Voting-to-Stay procedure to remove redundant heads, thus achieving a transformer with lighter weights.
- Score: 35.556186723898485
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent studies have revealed some issues of Multi-Head Attention (MHA), e.g.,
redundancy and over-parameterization. Specifically, the heads of MHA were
originally designed to attend to information from different representation
subspaces, whereas prior studies found that some attention heads likely learn
similar features and can be pruned without harming performance. Inspired by the
minimum-redundancy feature selection, we assume that focusing on the most
representative and distinctive features with minimum resources can mitigate the
above issues and lead to more effective and efficient MHAs. In particular, we
propose Grouped Head Attention, trained with a self-supervised group constraint
that group attention heads, where each group focuses on an essential but
distinctive feature subset. We additionally propose a Voting-to-Stay procedure
to remove redundant heads, thus achieving a transformer with lighter weights.
Moreover, our method achieves significant performance gains on three
well-established tasks while considerably compressing parameters.
Related papers
- MoH: Multi-Head Attention as Mixture-of-Head Attention [63.67734699877724]
We upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level.
We propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts mechanism.
MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters.
arXiv Detail & Related papers (2024-10-15T17:59:44Z) - Pit One Against Many: Leveraging Attention-head Embeddings for
Parameter-efficient Multi-head Attention [42.92397219764559]
We propose an alternative module that uses only a single shared projection matrix and multiple head embeddings (MHE)
We empirically demonstrate that our MHE attention is substantially more memory efficient compared to alternative attention mechanisms.
arXiv Detail & Related papers (2023-10-11T21:38:40Z) - Rethinking Label Smoothing on Multi-hop Question Answering [87.68071401870283]
Multi-Hop Question Answering (MHQA) is a significant area in question answering.
In this work, we analyze the primary factors limiting the performance of multi-hop reasoning.
We propose a novel label smoothing technique, F1 Smoothing, which incorporates uncertainty into the learning process.
arXiv Detail & Related papers (2022-12-19T14:48:08Z) - Analysis of Self-Attention Head Diversity for Conformer-based Automatic
Speech Recognition [36.53453860656191]
We investigate approaches to increasing attention head diversity.
We show that introducing diversity-promoting auxiliary loss functions during training is a more effective approach.
Finally, we draw a connection between the diversity of attention heads and the similarity of the gradients of head parameters.
arXiv Detail & Related papers (2022-09-13T15:50:03Z) - Paying More Attention to Self-attention: Improving Pre-trained Language
Models via Attention Guiding [35.958164594419515]
Pre-trained language models (PLM) have demonstrated their effectiveness for a broad range of information retrieval and natural language processing tasks.
As the core part of PLM, multi-head self-attention is appealing for its ability to jointly attend to information from different positions.
We propose two kinds of attention guiding methods, i.e., map discrimination guiding (MDG) and attention pattern decorrelation guiding (PDG)
arXiv Detail & Related papers (2022-04-06T16:22:02Z) - A Dynamic Head Importance Computation Mechanism for Neural Machine
Translation [22.784419165117512]
Multiple parallel attention mechanisms that use multiple attention heads facilitate greater performance of the Transformer model for various applications.
In this work, we focus on designing a Dynamic Head Importance Computation Mechanism (DHICM) to dynamically calculate the importance of a head with respect to the input.
We add an extra loss function to prevent the model from assigning same score to all heads, to identify more important heads and improvise performance.
arXiv Detail & Related papers (2021-08-03T09:16:55Z) - Repulsive Attention: Rethinking Multi-head Attention as Bayesian
Inference [68.12511526813991]
We provide a novel understanding of multi-head attention from a Bayesian perspective.
We propose a non-parametric approach that explicitly improves the repulsiveness in multi-head attention.
Experiments on various attention models and applications demonstrate that the proposed repulsive attention can improve the learned feature diversity.
arXiv Detail & Related papers (2020-09-20T06:32:23Z) - Low-Rank Bottleneck in Multi-head Attention Models [74.83235382203604]
We argue that the scaling between the number of heads and the size of each head in the current architecture gives rise to a low-rank bottleneck in attention heads.
We propose to set the head size of an attention unit to input sequence length, and independent of the number of heads, resulting in multi-head attention layers with provably more expressive power.
arXiv Detail & Related papers (2020-02-17T16:16:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.