Related papers: The Quarks of Attention

The Quarks of Attention

URL: http://arxiv.org/abs/2202.08371v1
Date: Tue, 15 Feb 2022 18:47:19 GMT
Title: The Quarks of Attention
Authors: Pierre Baldi and Roman Vershynin
Abstract summary: In deep learning, attention-based neural architectures are widely used to tackle problems in natural language processing and beyond. We classify all possible fundamental building blocks of attention in terms of their source, target, and computational mechanism. We identify and study three most important mechanisms: additive activation attention, multiplicative output attention (output gating), and multiplicative synaptic attention (synaptic gating)
Score: 11.315881995916428
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Attention plays a fundamental role in both natural and artificial intelligence systems. In deep learning, attention-based neural architectures, such as transformer architectures, are widely used to tackle problems in natural language processing and beyond. Here we investigate the fundamental building blocks of attention and their computational properties. Within the standard model of deep learning, we classify all possible fundamental building blocks of attention in terms of their source, target, and computational mechanism. We identify and study three most important mechanisms: additive activation attention, multiplicative output attention (output gating), and multiplicative synaptic attention (synaptic gating). The gating mechanisms correspond to multiplicative extensions of the standard model and are used across all current attention-based deep learning architectures. We study their functional properties and estimate the capacity of several attentional building blocks in the case of linear and polynomial threshold gates. Surprisingly, additive activation attention plays a central role in the proofs of the lower bounds. Attention mechanisms reduce the depth of certain basic circuits and leverage the power of quadratic activations without incurring their full cost.

Related papers

Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition [52.11481619456093]
We find a function induction mechanism that explains the model's generalization from standard addition to off-by-one addition.<n>This mechanism resembles the structure of the induction head mechanism found in prior work and elevates it to a higher level of abstraction.<n>We find that this function induction mechanism is reused in a broader range of tasks, including synthetic tasks such as shifted multiple-choice QA and algorithmic tasks such as base-8 addition.
arXiv Detail & Related papers (2025-07-14T03:20:55Z)
Hadamard product in deep learning: Introduction, Advances and Challenges [68.26011575333268]
This survey examines a fundamental yet understudied primitive: the Hadamard product. Despite its widespread implementation across various applications, the Hadamard product has not been systematically analyzed as a core architectural primitive. We present the first comprehensive taxonomy of its applications in deep learning, identifying four principal domains: higher-order correlation, multimodal data fusion, dynamic representation modulation, and efficient pairwise operations.
arXiv Detail & Related papers (2025-04-17T17:26:29Z)
Towards understanding how attention mechanism works in deep learning [8.79364699260219]
We study the process of computing similarity using classic metrics and vector space properties in manifold learning, clustering, and supervised learning. We decompose the self-attention mechanism into a learnable pseudo-metric function and an information propagation process based on similarity computation. We propose a modified attention mechanism called metric-attention by leveraging the concept of metric learning to facilitate the ability to learn desired metrics more effectively.
arXiv Detail & Related papers (2024-12-24T08:52:06Z)
A Primal-Dual Framework for Transformers and Neural Networks [52.814467832108875]
Self-attention is key to the remarkable success of transformers in sequence modeling tasks. We show that the self-attention corresponds to the support vector expansion derived from a support vector regression problem. We propose two new attentions: Batch Normalized Attention (Attention-BN) and Attention with Scaled Head (Attention-SH)
arXiv Detail & Related papers (2024-06-19T19:11:22Z)
Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis [2.1605931466490795]
We show that self-attention projects its query vectors onto the principal component axes of its key matrix in a feature space. We propose Attention with Robust Principal Components (RPC-Attention), a novel class of robust attention that is resilient to data contamination.
arXiv Detail & Related papers (2024-06-19T18:22:32Z)
Continuum Attention for Neural Operators [6.425471760071227]
We study transformers in the function space setting. We prove that the attention mechanism as implemented in practice is a Monte Carlo or finite difference approximation of this operator. For this reason we also introduce a function space generalization of the patching strategy from computer vision, and introduce a class of associated neural operators.
arXiv Detail & Related papers (2024-06-10T17:25:46Z)
Binding Dynamics in Rotating Features [72.80071820194273]
We propose an alternative "cosine binding" mechanism, which explicitly computes the alignment between features and adjusts weights accordingly. This allows us to draw direct connections to self-attention and biological neural processes, and to shed light on the fundamental dynamics for object-centric representations to emerge in Rotating Features.
arXiv Detail & Related papers (2024-02-08T12:31:08Z)
Attention: Marginal Probability is All You Need? [0.0]
We propose an alternative Bayesian foundation for attentional mechanisms. We show how this unifies different attentional architectures in machine learning. We hope this work will guide more sophisticated intuitions into the key properties of attention architectures.
arXiv Detail & Related papers (2023-04-07T14:38:39Z)
Guiding Visual Question Answering with Attention Priors [76.21671164766073]
We propose to guide the attention mechanism using explicit linguistic-visual grounding. This grounding is derived by connecting structured linguistic concepts in the query to their referents among the visual objects. The resultant algorithm is capable of probing attention-based reasoning models, injecting relevant associative knowledge, and regulating the core reasoning process.
arXiv Detail & Related papers (2022-05-25T09:53:47Z)
Visual Attention Methods in Deep Learning: An In-Depth Survey [37.18104595529633]
Inspired by the human cognitive system, attention is a mechanism that imitates the human cognitive awareness about specific information. Deep learning has employed attention to boost performance for many applications. The literature lacks a comprehensive survey on attention techniques to guide researchers in employing attention in their deep models.
arXiv Detail & Related papers (2022-04-16T08:57:00Z)
Assessing the Impact of Attention and Self-Attention Mechanisms on the Classification of Skin Lesions [0.0]
We focus on two forms of attention mechanisms: attention modules and self-attention. Attention modules are used to reweight the features of each layer input tensor. Self-Attention, originally proposed in the area of Natural Language Processing makes it possible to relate all the items in an input sequence.
arXiv Detail & Related papers (2021-12-23T18:02:48Z)
Compositional Attention: Disentangling Search and Retrieval [66.7108739597771]
Multi-head, key-value attention is the backbone of the Transformer model and its variants. Standard attention heads learn a rigid mapping between search and retrieval. We propose a novel attention mechanism, called Compositional Attention, that replaces the standard head structure.
arXiv Detail & Related papers (2021-10-18T15:47:38Z)
SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity. Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism. We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z)
Deep Reinforced Attention Learning for Quality-Aware Visual Recognition [73.15276998621582]
We build upon the weakly-supervised generation mechanism of intermediate attention maps in any convolutional neural networks. We introduce a meta critic network to evaluate the quality of attention maps in the main network.
arXiv Detail & Related papers (2020-07-13T02:44:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.