Related papers: Towards Understanding Inductive Bias in Transformers: A View From Infinity

Towards Understanding Inductive Bias in Transformers: A View From Infinity

URL: http://arxiv.org/abs/2402.05173v2
Date: Tue, 28 May 2024 13:41:26 GMT
Title: Towards Understanding Inductive Bias in Transformers: A View From Infinity
Authors: Itay Lavie, Guy Gur-Ari, Zohar Ringel,
Abstract summary: We argue transformers tend to be biased towards more permutation symmetric functions in sequence space. We show that the representation theory of the symmetric group can be used to give quantitative analytical predictions. We argue WikiText dataset, does indeed possess a degree of permutation symmetry.
Score: 9.00214539845063
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study inductive bias in Transformers in the infinitely over-parameterized Gaussian process limit and argue transformers tend to be biased towards more permutation symmetric functions in sequence space. We show that the representation theory of the symmetric group can be used to give quantitative analytical predictions when the dataset is symmetric to permutations between tokens. We present a simplified transformer block and solve the model at the limit, including accurate predictions for the learning curves and network outputs. We show that in common setups, one can derive tight bounds in the form of a scaling law for the learnability as a function of the context length. Finally, we argue WikiText dataset, does indeed possess a degree of permutation symmetry.

Related papers

Approximation of Permutation Invariant Polynomials by Transformers: Efficient Construction in Column-Size [6.9060054915724]
Transformers are a type of neural network that have demonstrated remarkable performance across various domains. In this study, we investigate the ability of transformers to approximate column-symmetrics.
arXiv Detail & Related papers (2025-02-17T05:56:11Z)
Beyond the Permutation Symmetry of Transformers: The Role of Rotation for Model Fusion [43.299430093251736]
We introduce rotation symmetry, a novel form of parameter space symmetry for transformers. Unlike permutation symmetry, rotation symmetry operates in a continuous domain, thereby significantly expanding the equivalence set for transformers. We propose a theoretically optimal matching algorithm as a plug-and-play module to enhance model fusion.
arXiv Detail & Related papers (2025-02-01T01:44:55Z)
Graph Transformers Dream of Electric Flow [72.06286909236827]
We show that the linear Transformer, when applied to graph data, can implement algorithms that solve canonical problems. We present explicit weight configurations for implementing each such graph algorithm, and we bound the errors of the constructed Transformers by the errors of the underlying algorithms.
arXiv Detail & Related papers (2024-10-22T05:11:45Z)
Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape. This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z)
Clustering in pure-attention hardmax transformers and its role in sentiment analysis [0.0]
We rigorously characterize the behavior of transformers with hardmax self-attention and normalization sublayers as the number of layers tends to infinity. We show that the transformer inputsally converge to a clustered equilibrium determined by special points called leaders. We then leverage this theoretical understanding to solve sentiment analysis problems from language processing using a fully interpretable transformer model.
arXiv Detail & Related papers (2024-06-26T16:13:35Z)
Permutation-invariant quantum circuits [4.900041609957432]
We show the integration of the permutation symmetry as the most restrictive discrete symmetry into quantum circuits. The scaling of the number of parameters is found to be $mathcalO(n3)$, significantly lower than the general case.
arXiv Detail & Related papers (2023-12-22T18:42:48Z)
In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent. For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z)
Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem. We characterize the implicit bias of 1-layer transformers optimized with gradient descent. We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z)
Approximation and Estimation Ability of Transformers for Sequence-to-Sequence Functions with Infinite Dimensional Input [50.83356836818667]
We study the approximation and estimation ability of Transformers as sequence-to-sequence functions with infinite dimensional inputs. Our theoretical results support the practical success of Transformers for high dimensional data.
arXiv Detail & Related papers (2023-05-30T02:44:49Z)
Sampled Transformer for Point Sets [80.66097006145999]
sparse transformer can reduce the computational complexity of the self-attention layers to $O(n)$, whilst still being a universal approximator of continuous sequence-to-sequence functions. We propose an $O(n)$ complexity sampled transformer that can process point set elements directly without any additional inductive bias.
arXiv Detail & Related papers (2023-02-28T06:38:05Z)
Generative Adversarial Symmetry Discovery [19.098785309131458]
LieGAN represents symmetry as interpretable Lie algebra basis and can discover various symmetries. The learned symmetry can also be readily used in several existing equivariant neural networks to improve accuracy and generalization in prediction.
arXiv Detail & Related papers (2023-02-01T04:28:36Z)
The Parallelism Tradeoff: Limitations of Log-Precision Transformers [29.716269397142973]
We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory.
arXiv Detail & Related papers (2022-07-02T03:49:34Z)
A Probabilistic Interpretation of Transformers [91.3755431537592]
We propose a probabilistic interpretation of exponential dot product attention of transformers and contrastive learning based off of exponential families. We state theoretical limitations of our theory and the Hopfield theory and suggest directions for resolution.
arXiv Detail & Related papers (2022-04-28T23:05:02Z)
Learning Symmetric Embeddings for Equivariant World Models [9.781637768189158]
We propose learning symmetric embedding networks (SENs) that encode an input space (e.g. images) This network can be trained end-to-end with an equivariant task network to learn an explicitly symmetric representation. Our experiments demonstrate that SENs facilitate the application of equivariant networks to data with complex symmetry representations.
arXiv Detail & Related papers (2022-04-24T22:31:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.