Differentiable Subset Pruning of Transformer Heads
- URL: http://arxiv.org/abs/2108.04657v3
- Date: Thu, 27 Jul 2023 07:14:18 GMT
- Title: Differentiable Subset Pruning of Transformer Heads
- Authors: Jiaoda Li, Ryan Cotterell, Mrinmaya Sachan
- Abstract summary: We introduce a new head pruning technique that we term differentiable subset pruning.
We show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.
- Score: 71.7904179689271
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-head attention, a collection of several attention mechanisms that
independently attend to different parts of the input, is the key ingredient in
the Transformer. Recent work has shown, however, that a large proportion of the
heads in a Transformer's multi-head attention mechanism can be safely pruned
away without significantly harming the performance of the model; such pruning
leads to models that are noticeably smaller and faster in practice. Our work
introduces a new head pruning technique that we term differentiable subset
pruning. Intuitively, our method learns per-head importance variables and then
enforces a user-specified hard constraint on the number of unpruned heads. The
importance variables are learned via stochastic gradient descent. We conduct
experiments on natural language inference and machine translation; we show that
differentiable subset pruning performs comparably or better than previous works
while offering precise control of the sparsity level.
Related papers
- Differential Transformer [99.5117269150629]
Transformer tends to overallocate attention to irrelevant context.
We introduce Diff Transformer, which amplifies attention to relevant context while canceling noise.
It offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers.
arXiv Detail & Related papers (2024-10-07T17:57:38Z) - How Transformers Utilize Multi-Head Attention in In-Context Learning? A Case Study on Sparse Linear Regression [19.64743851296488]
In this study, we consider a sparse linear regression problem and investigate how a trained multi-head transformer performs in-context learning.
We experimentally discover that the utilization of multi-heads exhibits different patterns across layers.
We demonstrate that such a preprocess-then-optimize algorithm can significantly outperform naive gradient descent and ridge regression algorithms.
arXiv Detail & Related papers (2024-08-08T15:33:02Z) - Multiformer: A Head-Configurable Transformer-Based Model for Direct
Speech Translation [0.0]
Multiformer is a Transformer-based model which allows the use of different attention mechanisms on each head.
By doing this, the model is able to bias the self-attention towards the extraction of more diverse token interactions.
Results show that mixing attention patterns along the different heads and layers outperforms our baseline by up to 0.7 BLEU.
arXiv Detail & Related papers (2022-05-14T17:37:47Z) - Multi-head or Single-head? An Empirical Comparison for Transformer
Training [62.272657851060465]
Multi-head attention plays a crucial role in the recent success of Transformer models.
We show that jointly attending multiple positions is not a unique feature of multi-head attention.
We show that, with recent advances in deep learning, we can successfully stabilize the training of the 384-layer Transformer.
arXiv Detail & Related papers (2021-06-17T16:53:22Z) - Sparse Attention with Linear Units [60.399814410157425]
We introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU.
Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms.
Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment.
arXiv Detail & Related papers (2021-04-14T17:52:38Z) - Fixed Encoder Self-Attention Patterns in Transformer-Based Machine
Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns.
Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z) - Low-Rank Bottleneck in Multi-head Attention Models [74.83235382203604]
We argue that the scaling between the number of heads and the size of each head in the current architecture gives rise to a low-rank bottleneck in attention heads.
We propose to set the head size of an attention unit to input sequence length, and independent of the number of heads, resulting in multi-head attention layers with provably more expressive power.
arXiv Detail & Related papers (2020-02-17T16:16:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.