Related papers: Normalized Attention Without Probability Cage

Normalized Attention Without Probability Cage

URL: http://arxiv.org/abs/2005.09561v1
Date: Tue, 19 May 2020 16:26:34 GMT
Title: Normalized Attention Without Probability Cage
Authors: Oliver Richter and Roger Wattenhofer
Abstract summary: We show limitations of constraining attention weights to the probability simplex. We propose to replace the softmax in self-attention with normalization. We support our insights with empirical results from more than 25,000 trained models.
Score: 12.18340575383456
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Attention architectures are widely used; they recently gained renewed popularity with Transformers yielding a streak of state of the art results. Yet, the geometrical implications of softmax-attention remain largely unexplored. In this work we highlight the limitations of constraining attention weights to the probability simplex and the resulting convex hull of value vectors. We show that Transformers are sequence length dependent biased towards token isolation at initialization and contrast Transformers to simple max- and sum-pooling - two strong baselines rarely reported. We propose to replace the softmax in self-attention with normalization, yielding a hyperparameter and data-bias robust, generally applicable architecture. We support our insights with empirical results from more than 25,000 trained models. All results and implementations are made available.

Related papers

On Vanishing Variance in Transformer Length Generalization [23.706900145711913]
We show that even for today's frontier models, a longer sequence length results in a decrease in variance in the output of the multi-head attention modules. Our analyses attribute this improvement to a reduction-though not a complete elimination-of the distribution shift caused by vanishing variance.
arXiv Detail & Related papers (2025-04-03T17:59:56Z)
SAMformer: Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention [14.672072173674039]
We show that transformers are incapable of converging to their true solution despite their high expressive power. We propose a shallow lightweight transformer model that escapes bad local minima when optimized with sharpness-aware optimization. In particular, SAMformer surpasses current state-of-the-art methods and is on par with the biggest foundation model MOIRAI while having significantly fewer parameters.
arXiv Detail & Related papers (2024-02-15T18:55:05Z)
On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting. Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z)
Attention over pre-trained Sentence Embeddings for Long Document Classification [4.38566347001872]
transformers are often limited to short sequences due to their quadratic attention complexity on the number of tokens. We suggest to take advantage of pre-trained sentence transformers to start from semantically meaningful embeddings of the individual sentences. We report the results obtained by this simple architecture on three standard document classification datasets.
arXiv Detail & Related papers (2023-07-18T09:06:35Z)
Sumformer: Universal Approximation for Efficient Transformers [2.4832703558223725]
We introduce Sumformer, a novel and simple architecture capable of universally approxingimating sequence-to-sequence functions. We derive a new proof for Transformers, showing that just one attention layer is sufficient for universal approximation.
arXiv Detail & Related papers (2023-07-05T13:59:35Z)
Robust representations of oil wells' intervals via sparse attention mechanism [2.604557228169423]
We introduce the class of efficient Transformers named Regularized Transformers (Reguformers) The focus in our experiments is on oil&gas data, namely, well logs. To evaluate our models for such problems, we work with an industry-scale open dataset consisting of well logs of more than 20 wells.
arXiv Detail & Related papers (2022-12-29T09:56:33Z)
A Length-Extrapolatable Transformer [98.54835576985664]
We focus on length extrapolation, i.e., training on short texts while evaluating longer sequences. We introduce a relative position embedding to explicitly maximize attention resolution. We evaluate different Transformer variants with language modeling.
arXiv Detail & Related papers (2022-12-20T18:56:20Z)
Bird-Eye Transformers for Text Generation Models [49.47825106383972]
We propose a new architecture, called bird-eye transformer(BET), which goes one step further to improve the performance of transformers. Our proposed model achieves a better performance than the baseline transformer architectures onalldatasets.
arXiv Detail & Related papers (2022-10-08T09:51:15Z)
Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. Existing methods are either theoretically flawed or empirically ineffective for visual recognition. We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z)
SOFT: Softmax-free Transformer with Linear Complexity [112.9754491864247]
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention. Various attempts on approximating the self-attention with linear complexity have been made in Natural Language Processing. We identify that their limitations are rooted in keeping the softmax self-attention during approximations. For the first time, a softmax-free transformer or SOFT is proposed.
arXiv Detail & Related papers (2021-10-22T17:57:29Z)
TFill: Image Completion via a Transformer-Based Architecture [69.62228639870114]
We propose treating image completion as a directionless sequence-to-sequence prediction task. We employ a restrictive CNN with small and non-overlapping RF for token representation. In a second phase, to improve appearance consistency between visible and generated regions, a novel attention-aware layer (AAL) is introduced.
arXiv Detail & Related papers (2021-04-02T01:42:01Z)
Rethinking Attention with Performers [45.47365397101224]
We introduce Performers, Transformer architectures which can estimate full-rank-attention Transformers with provable accuracy. Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.
arXiv Detail & Related papers (2020-09-30T17:09:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.