Manifold-Preserving Transformers are Effective for Short-Long Range
Encoding
- URL: http://arxiv.org/abs/2310.14206v1
- Date: Sun, 22 Oct 2023 06:58:28 GMT
- Title: Manifold-Preserving Transformers are Effective for Short-Long Range
Encoding
- Authors: Ayan Sengupta, Md Shad Akhtar and Tanmoy Chakraborty
- Abstract summary: Multi-head self-attention-based Transformers have shown promise in different learning tasks.
We propose TransJect, an encoder model that guarantees a theoretical bound for layer-wise distance preservation between a pair of tokens.
- Score: 39.14128923434994
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-head self-attention-based Transformers have shown promise in different
learning tasks. Albeit these models exhibit significant improvement in
understanding short-term and long-term contexts from sequences, encoders of
Transformers and their variants fail to preserve layer-wise contextual
information. Transformers usually project tokens onto sparse manifolds and fail
to preserve mathematical equivalence among the token representations. In this
work, we propose TransJect, an encoder model that guarantees a theoretical
bound for layer-wise distance preservation between a pair of tokens. We propose
a simple alternative to dot-product attention to ensure Lipschitz continuity.
This allows TransJect to learn injective mappings to transform token
representations to different manifolds with similar topology and preserve
Euclidean distance between every pair of tokens in subsequent layers.
Evaluations across multiple benchmark short- and long-sequence classification
tasks show maximum improvements of 6.8% and 5.9%, respectively, over the
variants of Transformers. Additionally, TransJect displays 79% better
performance than Transformer on the language modeling task. We further
highlight the shortcomings of multi-head self-attention from the statistical
physics viewpoint. Although multi-head self-attention was incepted to learn
different abstraction levels within the networks, our empirical analyses
suggest that different attention heads learn randomly and unorderly. In
contrast, TransJect adapts a mixture of experts for regularization; these
experts are more orderly and balanced and learn different sparse
representations from the input sequences. TransJect exhibits very low entropy
and can be efficiently scaled to larger depths.
Related papers
- On the Role of Depth and Looping for In-Context Learning with Task Diversity [69.4145579827826]
We study in-context learning for linear regression with diverse tasks.
We show that multilayer Transformers are not robust to even distributional shifts as small as $O(e-L)$ in Wasserstein distance.
arXiv Detail & Related papers (2024-10-29T03:27:56Z) - Mitigating Over-smoothing in Transformers via Regularized Nonlocal
Functionals [31.328766460487355]
We show that self-attention layers in transformers minimize a functional which promotes smoothness, thereby causing token uniformity.
We propose a novel regularizer that penalizes the norm of the difference between the smooth output tokens from self-attention and the input tokens to preserve the fidelity of the tokens.
We empirically demonstrate the advantages of NeuTRENO over the baseline transformers and state-of-the-art methods in reducing the over-smoothing of token representations.
arXiv Detail & Related papers (2023-12-01T17:52:47Z) - iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions.
The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z) - Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for
Long Sequences [16.066338004414092]
textitDiffuser is a new efficient Transformer for sequence-to-sequence modeling.
It incorporates all token interactions within one attention layer while maintaining low computation and memory costs.
We show its ability to approximate full-attention by analyzing the graph expander property from the spectral perspective.
arXiv Detail & Related papers (2022-10-21T08:13:34Z) - Multi-Tailed Vision Transformer for Efficient Inference [44.43126137573205]
Vision Transformer (ViT) has achieved promising performance in image recognition.
We propose a Multi-Tailed Vision Transformer (MT-ViT) in the paper.
MT-ViT adopts multiple tails to produce visual sequences of different lengths for the following Transformer encoder.
arXiv Detail & Related papers (2022-03-03T09:30:55Z) - Smart Bird: Learnable Sparse Attention for Efficient and Effective
Transformer [51.79399904527525]
We propose Smart Bird, which is an efficient and effective Transformer with learnable sparse attention.
In Smart Bird, we first compute a sketched attention matrix with a single-head low-dimensional Transformer.
We then sample token pairs based on their probability scores derived from the sketched attention matrix to generate different sparse attention index matrices for different attention heads.
arXiv Detail & Related papers (2021-08-20T14:22:00Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Funnel-Transformer: Filtering out Sequential Redundancy for Efficient
Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one.
With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.