Related papers: Breaking Symmetry When Training Transformers

Breaking Symmetry When Training Transformers

URL: http://arxiv.org/abs/2402.05969v2
Date: Sun, 16 Jun 2024 22:18:36 GMT
Title: Breaking Symmetry When Training Transformers
Authors: Chunsheng Zuo, Michael Guerzhoy,
Abstract summary: We show that the prediction for output token $n+1$ of Transformer architectures without one of the mechanisms of positional encodings and causal attention is invariant to permutations of input tokens $1, 2,..., n-1$. We elaborate on the argument that the causal connection mechanism must be responsible for the fact that Transformers are able to model input sequences where the order is important.
Score: 3.434553688053531
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As we show in this paper, the prediction for output token $n+1$ of Transformer architectures without one of the mechanisms of positional encodings and causal attention is invariant to permutations of input tokens $1, 2, ..., n-1$. Usually, both mechanisms are employed and the symmetry with respect to the input tokens is broken. Recently, it has been shown that one can train Transformers without positional encodings. This must be enabled by the causal attention mechanism. In this paper, we elaborate on the argument that the causal connection mechanism must be responsible for the fact that Transformers are able to model input sequences where the order is important. Vertical "slices" of Transformers are all encouraged to represent the same location $k$ in the input sequence. We hypothesize that residual connections contribute to this phenomenon, and demonstrate evidence for this.

Related papers

Constant Bit-size Transformers Are Turing Complete [8.38684825915246]
We prove that any Turing machine running on inputs of arbitrary length can be simulated by a constant bit-size transformer.<n>Our approach relies on simulating Post machines, a Turing-complete computational model.
arXiv Detail & Related papers (2025-05-22T02:45:38Z)
On-Chip Learning via Transformer In-Context Learning [0.9353041869660692]
Self-attention mechanism requires transferring prior token projections from the main memory at each time step. We present a neuromorphic decoder-only transformer model that utilizes an on-chip plasticity processor to compute self-attention.
arXiv Detail & Related papers (2024-10-11T10:54:09Z)
Optimal Memorization Capacity of Transformers [32.01426831450348]
We show that Transformers can memorize labels with $tildeO(sqrtN)$ parameters in a next-token prediction setting for $N$ input sequences of length $n$. We also analyze the memorization capacity in the sequence-to-sequence setting, and find that $tildeO(sqrtnN)$ parameters are not only sufficient, but also necessary for Transformers with hardmax.
arXiv Detail & Related papers (2024-09-26T09:36:47Z)
Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z)
Toward a Theory of Tokenization in LLMs [26.516041872337887]
We study tokenization from a theoretical point of view by studying the behavior of transformers on simple data generating processes. We show that even the simplest unigram models (over tokens) learnt by transformers are able to model the probability of sequences drawn from $ktextth$-order Markov sources near optimally.
arXiv Detail & Related papers (2024-04-12T09:01:14Z)
How do Transformers perform In-Context Autoregressive Learning? [76.18489638049545]
We train a Transformer model on a simple next token prediction task. We show how a trained Transformer predicts the next token by first learning $W$ in-context, then applying a prediction mapping.
arXiv Detail & Related papers (2024-02-08T16:24:44Z)
Causal Interpretation of Self-Attention in Pre-Trained Transformers [4.419843514606336]
We propose a causal interpretation of self-attention in the Transformer neural network architecture. We use self-attention as a mechanism that estimates a structural equation model for a given input sequence of symbols. We demonstrate this method by providing causal explanations for the outcomes of Transformers in two tasks: sentiment classification (NLP) and recommendation.
arXiv Detail & Related papers (2023-10-31T09:27:12Z)
Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem. We characterize the implicit bias of 1-layer transformers optimized with gradient descent. We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z)
What Makes for Good Tokenizers in Vision Transformer? [62.44987486771936]
transformers are capable of extracting their pairwise relationships using self-attention. What makes for a good tokenizer has not been well understood in computer vision. Modulation across Tokens (MoTo) incorporates inter-token modeling capability through normalization. Regularization objective TokenProp is embraced in the standard training regime.
arXiv Detail & Related papers (2022-12-21T15:51:43Z)
SepTr: Separable Transformer for Audio Spectrogram Processing [74.41172054754928]
We propose a new vision transformer architecture called Separable Transformer (SepTr) SepTr employs two transformer blocks in a sequential manner, the first attending to tokens within the same frequency bin, and the second attending to tokens within the same time interval. We conduct experiments on three benchmark data sets, showing that our architecture outperforms conventional vision transformers and other state-of-the-art methods.
arXiv Detail & Related papers (2022-03-17T19:48:43Z)
On the Power of Saturated Transformers: A View from Circuit Complexity [87.20342701232869]
We show that saturated transformers transcend the limitations of hard-attention transformers. The jump from hard to saturated attention can be understood as increasing the transformer's effective circuit depth by a factor of $O(log n)$.
arXiv Detail & Related papers (2021-06-30T17:09:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.