Multiformer: A Head-Configurable Transformer-Based Model for Direct
Speech Translation
- URL: http://arxiv.org/abs/2205.07100v1
- Date: Sat, 14 May 2022 17:37:47 GMT
- Title: Multiformer: A Head-Configurable Transformer-Based Model for Direct
Speech Translation
- Authors: Gerard Sant, Gerard I. G\'allego, Belen Alastruey, Marta R.
Costa-Juss\`a
- Abstract summary: Multiformer is a Transformer-based model which allows the use of different attention mechanisms on each head.
By doing this, the model is able to bias the self-attention towards the extraction of more diverse token interactions.
Results show that mixing attention patterns along the different heads and layers outperforms our baseline by up to 0.7 BLEU.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based models have been achieving state-of-the-art results in
several fields of Natural Language Processing. However, its direct application
to speech tasks is not trivial. The nature of this sequences carries problems
such as long sequence lengths and redundancy between adjacent tokens.
Therefore, we believe that regular self-attention mechanism might not be well
suited for it.
Different approaches have been proposed to overcome these problems, such as
the use of efficient attention mechanisms. However, the use of these methods
usually comes with a cost, which is a performance reduction caused by
information loss. In this study, we present the Multiformer, a
Transformer-based model which allows the use of different attention mechanisms
on each head. By doing this, the model is able to bias the self-attention
towards the extraction of more diverse token interactions, and the information
loss is reduced. Finally, we perform an analysis of the head contributions, and
we observe that those architectures where all heads relevance is uniformly
distributed obtain better results. Our results show that mixing attention
patterns along the different heads and layers outperforms our baseline by up to
0.7 BLEU.
Related papers
- Counting in Small Transformers: The Delicate Interplay between Attention and Feed-Forward Layers [16.26331213222281]
We investigate how architectural design choices influence the space of solutions that a transformer can implement and learn.
We characterize two different counting strategies that small transformers can implement theoretically.
Our findings highlight that even in simple settings, slight variations in model design can cause significant changes to the solutions a transformer learns.
arXiv Detail & Related papers (2024-07-16T09:48:10Z) - How Much Does Attention Actually Attend? Questioning the Importance of
Attention in Pretrained Transformers [59.57128476584361]
We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones.
We find that without any input-dependent attention, all models achieve competitive performance.
We show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success.
arXiv Detail & Related papers (2022-11-07T12:37:54Z) - ER: Equivariance Regularizer for Knowledge Graph Completion [107.51609402963072]
We propose a new regularizer, namely, Equivariance Regularizer (ER)
ER can enhance the generalization ability of the model by employing the semantic equivariance between the head and tail entities.
The experimental results indicate a clear and substantial improvement over the state-of-the-art relation prediction methods.
arXiv Detail & Related papers (2022-06-24T08:18:05Z) - Is Cross-Attention Preferable to Self-Attention for Multi-Modal Emotion
Recognition? [36.67937514793215]
Cross-modal attention is seen as an effective mechanism for multi-modal fusion.
We implement and compare a cross-attention and a self-attention model.
We compare the models using different modality combinations for a 7-class emotion classification task.
arXiv Detail & Related papers (2022-02-18T15:44:14Z) - Assessing the Impact of Attention and Self-Attention Mechanisms on the
Classification of Skin Lesions [0.0]
We focus on two forms of attention mechanisms: attention modules and self-attention.
Attention modules are used to reweight the features of each layer input tensor.
Self-Attention, originally proposed in the area of Natural Language Processing makes it possible to relate all the items in an input sequence.
arXiv Detail & Related papers (2021-12-23T18:02:48Z) - Differentiable Subset Pruning of Transformer Heads [71.7904179689271]
We introduce a new head pruning technique that we term differentiable subset pruning.
We show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.
arXiv Detail & Related papers (2021-08-10T13:08:34Z) - Cascaded Head-colliding Attention [28.293881246428377]
Transformers have advanced the field of natural language processing (NLP) on a variety of important tasks.
We present cascaded head-colliding attention (CODA) which explicitly models the interactions between attention heads through a hierarchical variational distribution.
arXiv Detail & Related papers (2021-05-31T10:06:42Z) - Attention that does not Explain Away [54.42960937271612]
Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks.
A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances.
We propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the "explaining away" effect.
arXiv Detail & Related papers (2020-09-29T21:05:39Z) - Self-Attention Attribution: Interpreting Information Interactions Inside
Transformer [89.21584915290319]
We propose a self-attention attribution method to interpret the information interactions inside Transformer.
We show that the attribution results can be used as adversarial patterns to implement non-targeted attacks towards BERT.
arXiv Detail & Related papers (2020-04-23T14:58:22Z) - Low-Rank Bottleneck in Multi-head Attention Models [74.83235382203604]
We argue that the scaling between the number of heads and the size of each head in the current architecture gives rise to a low-rank bottleneck in attention heads.
We propose to set the head size of an attention unit to input sequence length, and independent of the number of heads, resulting in multi-head attention layers with provably more expressive power.
arXiv Detail & Related papers (2020-02-17T16:16:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.