Reorganizing attention-space geometry with expressive attention
- URL: http://arxiv.org/abs/2407.18601v2
- Date: Wed, 08 Jan 2025 09:30:47 GMT
- Title: Reorganizing attention-space geometry with expressive attention
- Authors: Claudius Gros,
- Abstract summary: We study expressive attention (EA), which is based on $(mathbfQTmathbfK)2$, the squared dot product.
EA can be introduced into any attention-based code without additional compute costs or memory requirements.
- Score: 2.7195102129095003
- License:
- Abstract: Attention regulates information transfer between tokens. For this, query and key vectors are compared, typically in terms of a scalar product, $\mathbf{Q}^T\mathbf{K}$, together with a subsequent softmax normalization. In geometric terms, the standard dot-product attention (DPA) leads to large/small attention weights for parallel/antiparallel queries and keys. Here we study expressive attention (EA), which is based on $(\mathbf{Q}^T\mathbf{K})^2$, the squared dot product. In this case, attention is enhanced when query and key are either parallel or antiparallel, and suppressed for orthogonal configurations. EA can be introduced into any attention-based code without additional compute costs or memory requirements. For a series of autoregressive prediction tasks, we find that expressive attention performs at least as well as vanilla DPA. Increasing task complexity, EA is observed to outperform DPA with increasing margins, which also holds for multi-task settings. For a given model size, EA manages to achieve 100% performance for a range of complexity levels not accessible to DPA. Our results show that it is possible to reorganize the geometry of the matching condition in the space of attention heads without loss of performance.
Related papers
- ZETA: Leveraging Z-order Curves for Efficient Top-k Attention [22.90397380324185]
We propose ZETA to enable parallel querying of past tokens for entire sequences.
ZETA matches the performance of standard attention on the synthetic textscMulti-Query Associative Recall task.
arXiv Detail & Related papers (2025-01-24T15:33:05Z) - SEA: Sparse Linear Attention with Estimated Attention Mask [51.22399593954608]
Long seqeuences pose a problem due to the quadratic complexity of the attention operation.
Previous research has aimed to lower the complexity by sparsifying or linearly approximating the attention matrix.
We propose SEA: Sparse linear attention with an Estimated Attention mask.
arXiv Detail & Related papers (2023-10-03T03:56:26Z) - Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem.
We characterize the implicit bias of 1-layer transformers optimized with gradient descent.
We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z) - High-quality Task Division for Large-scale Entity Alignment [28.001266850114643]
DivEA achieves higher EA performance than alternative state-of-the-art solutions.
We devise a counterpart discovery method that exploits the locality principle of the EA task and the power of trained EA models.
arXiv Detail & Related papers (2022-08-22T14:46:38Z) - ClusterEA: Scalable Entity Alignment with Stochastic Training and
Normalized Mini-batch Similarities [26.724014626196322]
ClusterEA is capable of scaling up EA models and enhancing their results by leveraging normalization methods on mini-batches.
It first trains a large-scale GNN for EA in a fashion to produce entity embeddings.
Based on the embeddings, a novel ClusterSampler strategy is proposed for sampling highly overlapped mini-batches.
arXiv Detail & Related papers (2022-05-20T17:29:50Z) - cosFormer: Rethinking Softmax in Attention [60.557869510885205]
kernel methods are often adopted to reduce the complexity by approximating the softmax operator.
Due to the approximation errors, their performances vary in different tasks/corpus and suffer crucial performance drops.
We propose a linear transformer called cosFormer that can achieve comparable or better accuracy to the vanilla transformer.
arXiv Detail & Related papers (2022-02-17T17:53:48Z) - ABC: Attention with Bounded-memory Control [67.40631793251997]
We show that bounded-memory control (ABC) can be subsumed into one abstraction, attention with bounded-memory control (ABC)
ABC reveals new, unexplored possibilities. First, it connects several efficient attention variants that would otherwise seem apart.
Last, we present a new instance of ABC, which draws inspiration from existing ABC approaches, but replaces their memory-organizing functions with a learned, contextualized one.
arXiv Detail & Related papers (2021-10-06T03:53:25Z) - Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity.
We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention.
An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z) - $O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections.
We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.