Cascaded Head-colliding Attention
- URL: http://arxiv.org/abs/2105.14850v1
- Date: Mon, 31 May 2021 10:06:42 GMT
- Title: Cascaded Head-colliding Attention
- Authors: Lin Zheng, Zhiyong Wu, Lingpeng Kong
- Abstract summary: Transformers have advanced the field of natural language processing (NLP) on a variety of important tasks.
We present cascaded head-colliding attention (CODA) which explicitly models the interactions between attention heads through a hierarchical variational distribution.
- Score: 28.293881246428377
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers have advanced the field of natural language processing (NLP) on
a variety of important tasks. At the cornerstone of the Transformer
architecture is the multi-head attention (MHA) mechanism which models pairwise
interactions between the elements of the sequence. Despite its massive success,
the current framework ignores interactions among different heads, leading to
the problem that many of the heads are redundant in practice, which greatly
wastes the capacity of the model. To improve parameter efficiency, we
re-formulate the MHA as a latent variable model from a probabilistic
perspective. We present cascaded head-colliding attention (CODA) which
explicitly models the interactions between attention heads through a
hierarchical variational distribution. We conduct extensive experiments and
demonstrate that CODA outperforms the transformer baseline, by $0.6$ perplexity
on \texttt{Wikitext-103} in language modeling, and by $0.6$ BLEU on
\texttt{WMT14 EN-DE} in machine translation, due to its improvements on the
parameter efficiency.\footnote{Our implementation is publicly available at
\url{https://github.com/LZhengisme/CODA}.}
Related papers
- Mixture of Attention Yields Accurate Results for Tabular Data [21.410818837489973]
We propose MAYA, an encoder-decoder transformer-based framework.
In the encoder, we design a Mixture of Attention (MOA) that constructs multiple parallel attention branches.
We employ collaborative learning with a dynamic consistency weight constraint to produce more robust representations.
arXiv Detail & Related papers (2025-02-18T03:43:42Z) - Tensor Product Attention Is All You Need [54.40495407154611]
Product Attention (TPA) is a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly.
TPA achieves improved model quality alongside memory efficiency.
We introduce the ProducT ATTion Transformer (T6), a new model architecture for sequence modeling.
arXiv Detail & Related papers (2025-01-11T03:37:10Z) - Multimodal Latent Language Modeling with Next-Token Diffusion [111.93906046452125]
Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video)
We propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers.
arXiv Detail & Related papers (2024-12-11T18:57:32Z) - CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism.
We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies.
By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z) - Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning [73.73967342609603]
We introduce a predictor-corrector learning framework to minimize truncation errors.
We also propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor.
Our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters.
arXiv Detail & Related papers (2024-11-05T12:26:25Z) - Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads [10.169639612525643]
We propose a new multi-head self-attention (MHSA) variant named Fibottention, which can replace MHSA in Transformer architectures.
Fibottention is data-efficient and computationally more suitable for processing large numbers of tokens than the standard MHSA.
It employs structured sparse attention based on dilated Fibonacci sequences, which, uniquely, differ across attention heads, resulting in-like diverse features across heads.
arXiv Detail & Related papers (2024-06-27T17:59:40Z) - Improving Transformers with Dynamically Composable Multi-Head Attention [0.4999814847776097]
Multi-Head Attention (MHA) is a key component of Transformer.
We propose Dynamically Composable Multi-Head Attention (DCMHA) as a parameter and computation efficient attention architecture.
DCMHA can be used as a drop-in replacement of MHA in any transformer architecture to obtain the corresponding DCFormer.
arXiv Detail & Related papers (2024-05-14T12:41:11Z) - Mixture of Attention Heads: Selecting Attention Heads Per Token [40.04159325505842]
Mixture of Attention Heads (MoA) is a new architecture that combines multi-head attention with the MoE mechanism.
MoA achieves stronger performance than the standard multi-head attention layer.
MoA also automatically differentiates heads' utilities, providing a new perspective to discuss the model's interpretability.
arXiv Detail & Related papers (2022-10-11T04:54:05Z) - Multiformer: A Head-Configurable Transformer-Based Model for Direct
Speech Translation [0.0]
Multiformer is a Transformer-based model which allows the use of different attention mechanisms on each head.
By doing this, the model is able to bias the self-attention towards the extraction of more diverse token interactions.
Results show that mixing attention patterns along the different heads and layers outperforms our baseline by up to 0.7 BLEU.
arXiv Detail & Related papers (2022-05-14T17:37:47Z) - ConvBERT: Improving BERT with Span-based Dynamic Convolution [144.25748617961082]
BERT heavily relies on the global self-attention block and thus suffers large memory footprint and computation cost.
We propose a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies.
The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning.
arXiv Detail & Related papers (2020-08-06T07:43:19Z) - Improve Variational Autoencoder for Text Generationwith Discrete Latent
Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning.
VAEs tend to ignore latent variables with a strong auto-regressive decoder.
We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.