Towards Understanding Inductive Bias in Transformers: A View From Infinity
- URL: http://arxiv.org/abs/2402.05173v2
- Date: Tue, 28 May 2024 13:41:26 GMT
- Title: Towards Understanding Inductive Bias in Transformers: A View From Infinity
- Authors: Itay Lavie, Guy Gur-Ari, Zohar Ringel,
- Abstract summary: We argue transformers tend to be biased towards more permutation symmetric functions in sequence space.
We show that the representation theory of the symmetric group can be used to give quantitative analytical predictions.
We argue WikiText dataset, does indeed possess a degree of permutation symmetry.
- Score: 9.00214539845063
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study inductive bias in Transformers in the infinitely over-parameterized Gaussian process limit and argue transformers tend to be biased towards more permutation symmetric functions in sequence space. We show that the representation theory of the symmetric group can be used to give quantitative analytical predictions when the dataset is symmetric to permutations between tokens. We present a simplified transformer block and solve the model at the limit, including accurate predictions for the learning curves and network outputs. We show that in common setups, one can derive tight bounds in the form of a scaling law for the learnability as a function of the context length. Finally, we argue WikiText dataset, does indeed possess a degree of permutation symmetry.
Related papers
- Graph Transformers Dream of Electric Flow [72.06286909236827]
We show that the linear Transformer, when applied to graph data, can implement algorithms that solve canonical problems.
We present explicit weight configurations for implementing each such graph algorithm, and we bound the errors of the constructed Transformers by the errors of the underlying algorithms.
arXiv Detail & Related papers (2024-10-22T05:11:45Z) - Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape.
This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z) - Clustering in pure-attention hardmax transformers and its role in sentiment analysis [0.0]
We rigorously characterize the behavior of transformers with hardmax self-attention and normalization sublayers as the number of layers tends to infinity.
We show that the transformer inputsally converge to a clustered equilibrium determined by special points called leaders.
We then leverage this theoretical understanding to solve sentiment analysis problems from language processing using a fully interpretable transformer model.
arXiv Detail & Related papers (2024-06-26T16:13:35Z) - Permutation-invariant quantum circuits [4.900041609957432]
We show the integration of the permutation symmetry as the most restrictive discrete symmetry into quantum circuits.
The scaling of the number of parameters is found to be $mathcalO(n3)$, significantly lower than the general case.
arXiv Detail & Related papers (2023-12-22T18:42:48Z) - In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent.
For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z) - Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem.
We characterize the implicit bias of 1-layer transformers optimized with gradient descent.
We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z) - Approximation and Estimation Ability of Transformers for
Sequence-to-Sequence Functions with Infinite Dimensional Input [50.83356836818667]
We study the approximation and estimation ability of Transformers as sequence-to-sequence functions with infinite dimensional inputs.
Our theoretical results support the practical success of Transformers for high dimensional data.
arXiv Detail & Related papers (2023-05-30T02:44:49Z) - Generative Adversarial Symmetry Discovery [19.098785309131458]
LieGAN represents symmetry as interpretable Lie algebra basis and can discover various symmetries.
The learned symmetry can also be readily used in several existing equivariant neural networks to improve accuracy and generalization in prediction.
arXiv Detail & Related papers (2023-02-01T04:28:36Z) - The Parallelism Tradeoff: Limitations of Log-Precision Transformers [29.716269397142973]
We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens can be simulated by constant-depth logspace-uniform threshold circuits.
This provides insight on the power of transformers using known results in complexity theory.
arXiv Detail & Related papers (2022-07-02T03:49:34Z) - A Probabilistic Interpretation of Transformers [91.3755431537592]
We propose a probabilistic interpretation of exponential dot product attention of transformers and contrastive learning based off of exponential families.
We state theoretical limitations of our theory and the Hopfield theory and suggest directions for resolution.
arXiv Detail & Related papers (2022-04-28T23:05:02Z) - Learning Symmetric Embeddings for Equivariant World Models [9.781637768189158]
We propose learning symmetric embedding networks (SENs) that encode an input space (e.g. images)
This network can be trained end-to-end with an equivariant task network to learn an explicitly symmetric representation.
Our experiments demonstrate that SENs facilitate the application of equivariant networks to data with complex symmetry representations.
arXiv Detail & Related papers (2022-04-24T22:31:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.