Sinkformers: Transformers with Doubly Stochastic Attention
- URL: http://arxiv.org/abs/2110.11773v1
- Date: Fri, 22 Oct 2021 13:25:01 GMT
- Title: Sinkformers: Transformers with Doubly Stochastic Attention
- Authors: Michael E. Sander, Pierre Ablin, Mathieu Blondel, Gabriel Peyr\'e
- Abstract summary: We use Sinkhorn's algorithm to make attention matrices doubly. We call the resulting model a Sinkformer.
On the experimental side, we show Sinkformers enhance model accuracy in vision and natural language processing tasks.
Importantly, on 3D shapes classification, Sinkformers lead to a significant improvement.
- Score: 22.32840998053339
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Attention based models such as Transformers involve pairwise interactions
between data points, modeled with a learnable attention matrix. Importantly,
this attention matrix is normalized with the SoftMax operator, which makes it
row-wise stochastic. In this paper, we propose instead to use Sinkhorn's
algorithm to make attention matrices doubly stochastic. We call the resulting
model a Sinkformer. We show that the row-wise stochastic attention matrices in
classical Transformers get close to doubly stochastic matrices as the number of
epochs increases, justifying the use of Sinkhorn normalization as an
informative prior. On the theoretical side, we show that, unlike the SoftMax
operation, this normalization makes it possible to understand the iterations of
self-attention modules as a discretized gradient-flow for the Wasserstein
metric. We also show in the infinite number of samples limit that, when
rescaling both attention matrices and depth, Sinkformers operate a heat
diffusion. On the experimental side, we show that Sinkformers enhance model
accuracy in vision and natural language processing tasks. In particular, on 3D
shapes classification, Sinkformers lead to a significant improvement.
Related papers
- EulerFormer: Sequential User Behavior Modeling with Complex Vector Attention [88.45459681677369]
We propose a novel transformer variant with complex vector attention, named EulerFormer.
It provides a unified theoretical framework to formulate both semantic difference and positional difference.
It is more robust to semantic variations and possesses moresuperior theoretical properties in principle.
arXiv Detail & Related papers (2024-03-26T14:18:43Z) - Toeplitz Neural Network for Sequence Modeling [46.04964190407727]
We show that a Toeplitz matrix-vector production trick can reduce the space-time complexity of the sequence modeling to log linear.
A lightweight sub-network called relative position encoder is proposed to generate relative position coefficients with a fixed budget of parameters.
Despite being trained on 512-token sequences, our model can extrapolate input sequence length up to 14K tokens in inference with consistent performance.
arXiv Detail & Related papers (2023-05-08T14:49:01Z) - STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition [50.064502884594376]
We study the problem of human action recognition using motion capture (MoCap) sequences.
We propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences.
The proposed method achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models.
arXiv Detail & Related papers (2023-03-31T16:19:27Z) - Transformers meet Stochastic Block Models: Attention with Data-Adaptive
Sparsity and Cost [53.746169882193456]
Recent works have proposed various sparse attention modules to overcome the quadratic cost of self-attention.
We propose a model that resolves both problems by endowing each attention head with a mixed-membership Block Model.
Our model outperforms previous efficient variants as well as the original Transformer with full attention.
arXiv Detail & Related papers (2022-10-27T15:30:52Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - cosFormer: Rethinking Softmax in Attention [60.557869510885205]
kernel methods are often adopted to reduce the complexity by approximating the softmax operator.
Due to the approximation errors, their performances vary in different tasks/corpus and suffer crucial performance drops.
We propose a linear transformer called cosFormer that can achieve comparable or better accuracy to the vanilla transformer.
arXiv Detail & Related papers (2022-02-17T17:53:48Z) - Sketching as a Tool for Understanding and Accelerating Self-attention
for Long Sequences [52.6022911513076]
Transformer-based models are not efficient in processing long sequences due to the quadratic space and time complexity of the self-attention modules.
We propose Linformer and Informer to reduce the quadratic complexity to linear (modulo logarithmic factors) via low-dimensional projection and row selection.
Based on the theoretical analysis, we propose Skeinformer to accelerate self-attention and further improve the accuracy of matrix approximation to self-attention.
arXiv Detail & Related papers (2021-12-10T06:58:05Z) - Predicting Attention Sparsity in Transformers [0.9786690381850356]
We propose Sparsefinder, a model trained to identify the sparsity pattern of entmax attention before computing it.
Our work provides a new angle to study model efficiency by doing extensive analysis of the tradeoff between the sparsity and recall of the predicted attention graph.
arXiv Detail & Related papers (2021-09-24T20:51:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.