Cascaded Head-colliding Attention
- URL: http://arxiv.org/abs/2105.14850v1
- Date: Mon, 31 May 2021 10:06:42 GMT
- Title: Cascaded Head-colliding Attention
- Authors: Lin Zheng, Zhiyong Wu, Lingpeng Kong
- Abstract summary: Transformers have advanced the field of natural language processing (NLP) on a variety of important tasks.
We present cascaded head-colliding attention (CODA) which explicitly models the interactions between attention heads through a hierarchical variational distribution.
- Score: 28.293881246428377
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers have advanced the field of natural language processing (NLP) on
a variety of important tasks. At the cornerstone of the Transformer
architecture is the multi-head attention (MHA) mechanism which models pairwise
interactions between the elements of the sequence. Despite its massive success,
the current framework ignores interactions among different heads, leading to
the problem that many of the heads are redundant in practice, which greatly
wastes the capacity of the model. To improve parameter efficiency, we
re-formulate the MHA as a latent variable model from a probabilistic
perspective. We present cascaded head-colliding attention (CODA) which
explicitly models the interactions between attention heads through a
hierarchical variational distribution. We conduct extensive experiments and
demonstrate that CODA outperforms the transformer baseline, by $0.6$ perplexity
on \texttt{Wikitext-103} in language modeling, and by $0.6$ BLEU on
\texttt{WMT14 EN-DE} in machine translation, due to its improvements on the
parameter efficiency.\footnote{Our implementation is publicly available at
\url{https://github.com/LZhengisme/CODA}.}
Related papers
- Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning [73.73967342609603]
We introduce a predictor-corrector learning framework to minimize truncation errors.
We also propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor.
Our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters.
arXiv Detail & Related papers (2024-11-05T12:26:25Z) - Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data.
We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z) - Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers [56.264673865476986]
This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models.
SLA improves the model's ability to capture dependencies between high-level abstract features and low-level details.
Our implementation extends the Transformer's functionality by enabling queries in a given layer to interact with keys and values from both the current layer and one preceding layer.
arXiv Detail & Related papers (2024-06-17T07:24:38Z) - Improving Transformers with Dynamically Composable Multi-Head Attention [0.4999814847776097]
Multi-Head Attention (MHA) is a key component of Transformer.
We propose Dynamically Composable Multi-Head Attention (DCMHA) as a parameter and computation efficient attention architecture.
DCMHA can be used as a drop-in replacement of MHA in any transformer architecture to obtain the corresponding DCFormer.
arXiv Detail & Related papers (2024-05-14T12:41:11Z) - Probabilistic Topic Modelling with Transformer Representations [0.9999629695552195]
We propose the Transformer-Representation Neural Topic Model (TNTM)
This approach unifies the powerful and versatile notion of topics based on transformer embeddings with fully probabilistic modelling.
Experimental results show that our proposed model achieves results on par with various state-of-the-art approaches in terms of embedding coherence.
arXiv Detail & Related papers (2024-03-06T14:27:29Z) - Mixture of Attention Heads: Selecting Attention Heads Per Token [40.04159325505842]
Mixture of Attention Heads (MoA) is a new architecture that combines multi-head attention with the MoE mechanism.
MoA achieves stronger performance than the standard multi-head attention layer.
MoA also automatically differentiates heads' utilities, providing a new perspective to discuss the model's interpretability.
arXiv Detail & Related papers (2022-10-11T04:54:05Z) - Learning Multiscale Transformer Models for Sequence Generation [33.73729074207944]
We build a multiscale Transformer model by establishing relationships among scales based on word-boundary information and phrase-level prior knowledge.
Notably, it yielded consistent performance gains over the strong baseline on several test sets without sacrificing the efficiency.
arXiv Detail & Related papers (2022-06-19T07:28:54Z) - Multiformer: A Head-Configurable Transformer-Based Model for Direct
Speech Translation [0.0]
Multiformer is a Transformer-based model which allows the use of different attention mechanisms on each head.
By doing this, the model is able to bias the self-attention towards the extraction of more diverse token interactions.
Results show that mixing attention patterns along the different heads and layers outperforms our baseline by up to 0.7 BLEU.
arXiv Detail & Related papers (2022-05-14T17:37:47Z) - VisBERT: Hidden-State Visualizations for Transformers [66.86452388524886]
We present VisBERT, a tool for visualizing the contextual token representations within BERT for the task of (multi-hop) Question Answering.
VisBERT enables users to get insights about the model's internal state and to explore its inference steps or potential shortcomings.
arXiv Detail & Related papers (2020-11-09T15:37:43Z) - ConvBERT: Improving BERT with Span-based Dynamic Convolution [144.25748617961082]
BERT heavily relies on the global self-attention block and thus suffers large memory footprint and computation cost.
We propose a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies.
The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning.
arXiv Detail & Related papers (2020-08-06T07:43:19Z) - Improve Variational Autoencoder for Text Generationwith Discrete Latent
Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning.
VAEs tend to ignore latent variables with a strong auto-regressive decoder.
We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.