Transformers are Universal In-context Learners
- URL: http://arxiv.org/abs/2408.01367v2
- Date: Thu, 3 Oct 2024 02:43:59 GMT
- Title: Transformers are Universal In-context Learners
- Authors: Takashi Furuya, Maarten V. de Hoop, Gabriel Peyré,
- Abstract summary: We show that deep transformers can approximate continuous in-context mappings to arbitrary precision, uniformly over compact token domains.
A key aspect of our results, compared to existing findings, is that for a fixed precision, a single transformer can operate on an arbitrary (even infinite) number of tokens.
- Score: 21.513210412394965
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers are deep architectures that define "in-context mappings" which enable predicting new tokens based on a given set of tokens (such as a prompt in NLP applications or a set of patches for a vision transformer). In this work, we study in particular the ability of these architectures to handle an arbitrarily large number of context tokens. To mathematically, uniformly address their expressivity, we consider the case that the mappings are conditioned on a context represented by a probability distribution of tokens which becomes discrete for a finite number of these. The relevant notion of smoothness then corresponds to continuity in terms of the Wasserstein distance between these contexts. We demonstrate that deep transformers are universal and can approximate continuous in-context mappings to arbitrary precision, uniformly over compact token domains. A key aspect of our results, compared to existing findings, is that for a fixed precision, a single transformer can operate on an arbitrary (even infinite) number of tokens. Additionally, it operates with a fixed embedding dimension of tokens (this dimension does not increase with precision) and a fixed number of heads (proportional to the dimension). The use of MLPs between multi-head attention layers is also explicitly controlled. We consider both unmasked attentions (as used for the vision transformer) and masked causal attentions (as used for NLP and time series applications). We tackle the causal setting leveraging a space-time lifting to analyze causal attention as a mapping over probability distributions of tokens.
Related papers
- Next-token prediction capacity: general upper bounds and a lower bound for transformers [24.31928133575083]
We show that a decoder-only transformer can interpolate next-token distributions for context sequences.
We show that the minimal number of parameters for memorization is sufficient for being able to train the model to the entropy lower bound.
arXiv Detail & Related papers (2024-05-22T15:09:41Z) - Manifold-Preserving Transformers are Effective for Short-Long Range
Encoding [39.14128923434994]
Multi-head self-attention-based Transformers have shown promise in different learning tasks.
We propose TransJect, an encoder model that guarantees a theoretical bound for layer-wise distance preservation between a pair of tokens.
arXiv Detail & Related papers (2023-10-22T06:58:28Z) - Accurate Image Restoration with Attention Retractable Transformer [50.05204240159985]
We propose Attention Retractable Transformer (ART) for image restoration.
ART presents both dense and sparse attention modules in the network.
We conduct extensive experiments on image super-resolution, denoising, and JPEG compression artifact reduction tasks.
arXiv Detail & Related papers (2022-10-04T07:35:01Z) - Addressing Token Uniformity in Transformers via Singular Value
Transformation [24.039280291845706]
Token uniformity is commonly observed in transformer-based models.
We show that a less skewed singular value distribution can alleviate the token uniformity' problem.
arXiv Detail & Related papers (2022-08-24T22:44:09Z) - Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot
Segmentation [58.4650849317274]
Volumetric Aggregation with Transformers (VAT) is a cost aggregation network for few-shot segmentation.
VAT attains state-of-the-art performance for semantic correspondence as well, where cost aggregation also plays a central role.
arXiv Detail & Related papers (2022-07-22T04:10:30Z) - The Parallelism Tradeoff: Limitations of Log-Precision Transformers [29.716269397142973]
We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens can be simulated by constant-depth logspace-uniform threshold circuits.
This provides insight on the power of transformers using known results in complexity theory.
arXiv Detail & Related papers (2022-07-02T03:49:34Z) - PSViT: Better Vision Transformer via Token Pooling and Attention Sharing [114.8051035856023]
We propose a PSViT: a ViT with token Pooling and attention Sharing to reduce the redundancy.
Experimental results show that the proposed scheme can achieve up to 6.6% accuracy improvement in ImageNet classification.
arXiv Detail & Related papers (2021-08-07T11:30:54Z) - Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity.
We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention.
An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z) - KVT: k-NN Attention for Boosting Vision Transformers [44.189475770152185]
We propose a sparse attention scheme, dubbed k-NN attention, for boosting vision transformers.
The proposed k-NN attention naturally inherits the local bias of CNNs without introducing convolutional operations.
We verify, both theoretically and empirically, that $k$-NN attention is powerful in distilling noise from input tokens and in speeding up training.
arXiv Detail & Related papers (2021-05-28T06:49:10Z) - Rethinking Global Context in Crowd Counting [70.54184500538338]
A pure transformer is used to extract features with global information from overlapping image patches.
Inspired by classification, we add a context token to the input sequence, to facilitate information exchange with tokens corresponding to image patches.
arXiv Detail & Related papers (2021-05-23T12:44:27Z) - Non-Autoregressive Machine Translation with Disentangled Context
Transformer [70.95181466892795]
State-of-the-art neural machine translation models generate a translation from left to right and every step is conditioned on the previously generated tokens.
We propose an attention-masking based model, called Disentangled Context (DisCo) transformer, that simultaneously generates all tokens given different contexts.
Our model achieves competitive, if not better, performance compared to the state of the art in non-autoregressive machine translation while significantly reducing decoding time on average.
arXiv Detail & Related papers (2020-01-15T05:32:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.