Sparse Sinkhorn Attention
- URL: http://arxiv.org/abs/2002.11296v1
- Date: Wed, 26 Feb 2020 04:18:01 GMT
- Title: Sparse Sinkhorn Attention
- Authors: Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan
- Abstract summary: We propose Sparse Sinkhorn Attention, a new efficient and sparse method for learning to attend.
We introduce a meta sorting network that learns to generate latent permutations over sequences.
Given sorted sequences, we are then able to compute quasi-global attention with only local windows.
- Score: 93.88158993722716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose Sparse Sinkhorn Attention, a new efficient and sparse method for
learning to attend. Our method is based on differentiable sorting of internal
representations. Concretely, we introduce a meta sorting network that learns to
generate latent permutations over sequences. Given sorted sequences, we are
then able to compute quasi-global attention with only local windows, improving
the memory efficiency of the attention module. To this end, we propose new
algorithmic innovations such as Causal Sinkhorn Balancing and SortCut, a
dynamic sequence truncation method for tailoring Sinkhorn Attention for
encoding and/or decoding purposes. Via extensive experiments on algorithmic
seq2seq sorting, language modeling, pixel-wise image generation, document
classification and natural language inference, we demonstrate that our memory
efficient Sinkhorn Attention method is competitive with vanilla attention and
consistently outperforms recently proposed efficient Transformer models such as
Sparse Transformers.
Related papers
- Compressed online Sinkhorn [3.2534959204741085]
We revisit the recently introduced online Sinkhorn algorithm of [Mensch and Peyr'e, 2020].
We improve the convergence analysis for the online Sinkhorn algorithm, the new rate that we obtain is faster than the previous rate under certain parameter choices.
Secondly, we propose the compressed online Sinkhorn algorithm which combines measure compression techniques with the online Sinkhorn algorithm.
arXiv Detail & Related papers (2023-10-08T05:33:32Z) - Toeplitz Neural Network for Sequence Modeling [46.04964190407727]
We show that a Toeplitz matrix-vector production trick can reduce the space-time complexity of the sequence modeling to log linear.
A lightweight sub-network called relative position encoder is proposed to generate relative position coefficients with a fixed budget of parameters.
Despite being trained on 512-token sequences, our model can extrapolate input sequence length up to 14K tokens in inference with consistent performance.
arXiv Detail & Related papers (2023-05-08T14:49:01Z) - A Unified Framework for Implicit Sinkhorn Differentiation [58.56866763433335]
We propose an algorithm that obtains analytical gradients of a Sinkhorn layer via implicit differentiation.
We show that it is computationally more efficient, particularly when resources like GPU memory are scarce.
arXiv Detail & Related papers (2022-05-13T14:45:31Z) - Discovering Non-monotonic Autoregressive Orderings with Variational
Inference [67.27561153666211]
We develop an unsupervised parallelizable learner that discovers high-quality generation orders purely from training data.
We implement the encoder as a Transformer with non-causal attention that outputs permutations in one forward pass.
Empirical results in language modeling tasks demonstrate that our method is context-aware and discovers orderings that are competitive with or even better than fixed orders.
arXiv Detail & Related papers (2021-10-27T16:08:09Z) - ABC: Attention with Bounded-memory Control [67.40631793251997]
We show that bounded-memory control (ABC) can be subsumed into one abstraction, attention with bounded-memory control (ABC)
ABC reveals new, unexplored possibilities. First, it connects several efficient attention variants that would otherwise seem apart.
Last, we present a new instance of ABC, which draws inspiration from existing ABC approaches, but replaces their memory-organizing functions with a learned, contextualized one.
arXiv Detail & Related papers (2021-10-06T03:53:25Z) - Learning Hard Retrieval Decoder Attention for Transformers [69.40942736249397]
Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily.
We show that our hard retrieval attention mechanism is 1.43 times faster in decoding.
arXiv Detail & Related papers (2020-09-30T13:18:57Z) - Hard Non-Monotonic Attention for Character-Level Transduction [65.17388794270694]
We introduce an exact, exponential-time algorithm for marginalizing over a number of non-monotonic alignments between two strings.
We compare soft and hard non-monotonic attention experimentally and find that the exact algorithm significantly improves performance over the approximation and outperforms soft attention.
arXiv Detail & Related papers (2018-08-29T20:00:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.