Transformers represent belief state geometry in their residual stream
- URL: http://arxiv.org/abs/2405.15943v3
- Date: Tue, 04 Feb 2025 03:38:57 GMT
- Title: Transformers represent belief state geometry in their residual stream
- Authors: Adam S. Shai, Sarah E. Marzen, Lucas Teixeira, Alexander Gietelink Oldenziel, Paul M. Riechers,
- Abstract summary: We present evidence that this structure is given by the meta-dynamics of belief updating over hidden states of the data-generating process.<n>Our work provides a general framework connecting the structure of training data to the geometric structure of activations inside transformers.
- Score: 40.803656512527645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: What computational structure are we building into large language models when we train them on next-token prediction? Here, we present evidence that this structure is given by the meta-dynamics of belief updating over hidden states of the data-generating process. Leveraging the theory of optimal prediction, we anticipate and then find that belief states are linearly represented in the residual stream of transformers, even in cases where the predicted belief state geometry has highly nontrivial fractal structure. We investigate cases where the belief state geometry is represented in the final residual stream or distributed across the residual streams of multiple layers, providing a framework to explain these observations. Furthermore we demonstrate that the inferred belief states contain information about the entire future, beyond the local next-token prediction that the transformers are explicitly trained on. Our work provides a general framework connecting the structure of training data to the geometric structure of activations inside transformers.
Related papers
- On the Robustness of Transformers against Context Hijacking for Linear Classification [26.1838836907147]
Transformer-based Large Language Models (LLMs) have demonstrated powerful in-context learning capabilities.
They can be disrupted by factually correct context, a phenomenon known as context hijacking.
We show that a well-trained deeper transformer can achieve higher robustness, which aligns with empirical observations.
arXiv Detail & Related papers (2025-02-21T17:31:00Z) - Constrained belief updates explain geometric structures in transformer representations [0.0]
We integrate the model-agnostic theory of optimal prediction with mechanistic interpretability to analyze transformers trained on a tractable family of hidden Markov models.
We find that attention heads carry out an algorithm with a natural interpretation in the probability simplex, and create representations with distinctive geometric structure.
arXiv Detail & Related papers (2025-02-04T03:03:54Z) - Transformers trained on proteins can learn to attend to Euclidean distance [0.0]
We show that Transformers can function independently as structure models when passed linear embeddings of coordinates.
We also show that pre-training protein Transformer encoders with structure improves performance on a downstream task.
arXiv Detail & Related papers (2025-02-03T17:12:44Z) - Dynamics of Transient Structure in In-Context Linear Regression Transformers [0.5242869847419834]
We show that when transformers are trained on in-context linear regression tasks with intermediate task diversity, they behave like ridge regression before specializing to the tasks in their training distribution.
This transition from a general solution to a specialized solution is revealed by joint trajectory principal component analysis.
We empirically validate this explanation by measuring the model complexity of our transformers as defined by the local learning coefficient.
arXiv Detail & Related papers (2025-01-29T16:32:14Z) - Understanding Scaling Laws with Statistical and Approximation Theory for Transformer Neural Networks on Intrinsically Low-dimensional Data [4.481230230086981]
In deep neural networks, a model's generalization error is often observed to follow a power scaling law dependent both on the model size and the data size.
We show that our theory predicts a power law between the generalization error and both the training data size and the network size for transformers.
By leveraging low-dimensional data structures under a manifold hypothesis, we are able to explain transformer scaling laws in a way which respects the data geometry.
arXiv Detail & Related papers (2024-11-11T01:05:28Z) - Grokking of Hierarchical Structure in Vanilla Transformers [72.45375959893218]
We show that transformer language models can learn to generalize hierarchically after training for extremely long periods.
intermediate-depth models generalize better than both very deep and very shallow transformers.
arXiv Detail & Related papers (2023-05-30T04:34:13Z) - DejaVu: Conditional Regenerative Learning to Enhance Dense Prediction [45.89461725594674]
We use conditional image regeneration as additional supervision during training to improve deep networks for dense prediction tasks.
DejaVu can be extended to incorporate an attention-based regeneration module within the dense prediction network.
arXiv Detail & Related papers (2023-03-02T20:56:36Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z) - Characterizing Intrinsic Compositionality in Transformers with Tree
Projections [72.45375959893218]
neural models like transformers can route information arbitrarily between different parts of their input.
We show that transformers for three different tasks become more treelike over the course of training.
These trees are predictive of model behavior, with more tree-like models generalizing better on tests of compositional generalization.
arXiv Detail & Related papers (2022-11-02T17:10:07Z) - Unsupervised Learning of Equivariant Structure from Sequences [30.974508897223124]
We present an unsupervised framework to learn the symmetry from the time sequence of length at least three.
We will demonstrate that, with our framework, the hidden disentangled structure of the dataset naturally emerges as a by-product.
arXiv Detail & Related papers (2022-10-12T07:29:18Z) - Structural Biases for Improving Transformers on Translation into
Morphologically Rich Languages [120.74406230847904]
TP-Transformer augments the traditional Transformer architecture to include an additional component to represent structure.
The second method imbues structure at the data level by segmenting the data with morphological tokenization.
We find that each of these two approaches allows the network to achieve better performance, but this improvement is dependent on the size of the dataset.
arXiv Detail & Related papers (2022-08-11T22:42:24Z) - Vision Transformers for Dense Prediction [77.34726150561087]
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks.
Our experiments show that this architecture yields substantial improvements on dense prediction tasks.
arXiv Detail & Related papers (2021-03-24T18:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.