A Primal-Dual Framework for Transformers and Neural Networks
- URL: http://arxiv.org/abs/2406.13781v1
- Date: Wed, 19 Jun 2024 19:11:22 GMT
- Title: A Primal-Dual Framework for Transformers and Neural Networks
- Authors: Tan M. Nguyen, Tam Nguyen, Nhat Ho, Andrea L. Bertozzi, Richard G. Baraniuk, Stanley J. Osher,
- Abstract summary: Self-attention is key to the remarkable success of transformers in sequence modeling tasks.
We show that the self-attention corresponds to the support vector expansion derived from a support vector regression problem.
We propose two new attentions: Batch Normalized Attention (Attention-BN) and Attention with Scaled Head (Attention-SH)
- Score: 52.814467832108875
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-attention is key to the remarkable success of transformers in sequence modeling tasks including many applications in natural language processing and computer vision. Like neural network layers, these attention mechanisms are often developed by heuristics and experience. To provide a principled framework for constructing attention layers in transformers, we show that the self-attention corresponds to the support vector expansion derived from a support vector regression problem, whose primal formulation has the form of a neural network layer. Using our framework, we derive popular attention layers used in practice and propose two new attentions: 1) the Batch Normalized Attention (Attention-BN) derived from the batch normalization layer and 2) the Attention with Scaled Head (Attention-SH) derived from using less training data to fit the SVR model. We empirically demonstrate the advantages of the Attention-BN and Attention-SH in reducing head redundancy, increasing the model's accuracy, and improving the model's efficiency in a variety of practical applications including image and time-series classification.
Related papers
- DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - MENTOR: Human Perception-Guided Pretraining for Increased Generalization [5.596752018167751]
We introduce MENTOR (huMan pErceptioN-guided preTraining fOr increased geneRalization)
We train an autoencoder to learn human saliency maps given an input image, without class labels.
We remove the decoder part, add a classification layer on top of the encoder, and fine-tune this new model conventionally.
arXiv Detail & Related papers (2023-10-30T13:50:44Z) - Centered Self-Attention Layers [89.21791761168032]
The self-attention mechanism in transformers and the message-passing mechanism in graph neural networks are repeatedly applied.
We show that this application inevitably leads to oversmoothing, i.e., to similar representations at the deeper layers.
We present a correction term to the aggregating operator of these mechanisms.
arXiv Detail & Related papers (2023-06-02T15:19:08Z) - Self-Supervised Implicit Attention: Guided Attention by The Model Itself [1.3406858660972554]
We propose Self-Supervised Implicit Attention (SSIA), a new approach that adaptively guides deep neural network models to gain attention by exploiting the properties of the models themselves.
SSIAA is a novel attention mechanism that does not require any extra parameters, computation, or memory access costs during inference.
Our implementation will be available on GitHub.
arXiv Detail & Related papers (2022-06-15T10:13:34Z) - Visual Attention Emerges from Recurrent Sparse Reconstruction [82.78753751860603]
We present a new attention formulation built on two prominent features of the human visual attention mechanism: recurrency and sparsity.
We show that self-attention is a special case of VARS with a single-step optimization and no sparsity constraint.
VARS can be readily used as a replacement for self-attention in popular vision transformers, consistently improving their robustness across various benchmarks.
arXiv Detail & Related papers (2022-04-23T00:35:02Z) - Visual Attention Network [90.0753726786985]
We propose a novel large kernel attention (LKA) module to enable self-adaptive and long-range correlations in self-attention.
We also introduce a novel neural network based on LKA, namely Visual Attention Network (VAN)
VAN outperforms the state-of-the-art vision transformers and convolutional neural networks with a large margin in extensive experiments.
arXiv Detail & Related papers (2022-02-20T06:35:18Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Evolving Attention with Residual Convolutions [29.305149185821882]
We propose a novel mechanism based on evolving attention to improve the performance of transformers.
The proposed attention mechanism achieves significant performance improvement over various state-of-the-art models for multiple tasks.
arXiv Detail & Related papers (2021-02-20T15:24:06Z) - Data-Informed Global Sparseness in Attention Mechanisms for Deep Neural Networks [33.07113523598028]
We propose Attention Pruning (AP), a framework that observes attention patterns in a fixed dataset and generates a global sparseness mask.
AP saves 90% of attention computation for language modeling and about 50% for machine translation and GLUE tasks, maintaining result quality.
arXiv Detail & Related papers (2020-11-20T13:58:21Z) - The Costs and Benefits of Goal-Directed Attention in Deep Convolutional
Neural Networks [6.445605125467574]
People deploy top-down, goal-directed attention to accomplish tasks, such as finding lost keys.
Motivated by selective attention in categorisation models, we developed a goal-directed attention mechanism that can process naturalistic (photographic) stimuli.
Our attentional mechanism incorporates top-down influences from prefrontal cortex (PFC) to support goal-directed behaviour.
arXiv Detail & Related papers (2020-02-06T16:42:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.