Transformers Use Causal World Models in Maze-Solving Tasks
- URL: http://arxiv.org/abs/2412.11867v2
- Date: Wed, 05 Mar 2025 23:16:16 GMT
- Title: Transformers Use Causal World Models in Maze-Solving Tasks
- Authors: Alex F. Spies, William Edwards, Michael I. Ivanitskiy, Adrians Skapars, Tilman Räuker, Katsumi Inoue, Alessandra Russo, Murray Shanahan,
- Abstract summary: We identify World Models in transformers trained on maze-solving tasks.<n>We find that it is easier to activate features than to suppress them.<n> positional encoding schemes appear to influence how World Models are structured within the model's residual stream.
- Score: 49.67445252528868
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies in interpretability have explored the inner workings of transformer models trained on tasks across various domains, often discovering that these networks naturally develop highly structured representations. When such representations comprehensively reflect the task domain's structure, they are commonly referred to as "World Models" (WMs). In this work, we identify WMs in transformers trained on maze-solving tasks. By using Sparse Autoencoders (SAEs) and analyzing attention patterns, we examine the construction of WMs and demonstrate consistency between SAE feature-based and circuit-based analyses. By subsequently intervening on isolated features to confirm their causal role, we find that it is easier to activate features than to suppress them. Furthermore, we find that models can reason about mazes involving more simultaneously active features than they encountered during training; however, when these same mazes (with greater numbers of connections) are provided to models via input tokens instead, the models fail. Finally, we demonstrate that positional encoding schemes appear to influence how World Models are structured within the model's residual stream.
Related papers
- Analyzing Fine-tuning Representation Shift for Multimodal LLMs Steering alignment [53.90425382758605]
We show how fine-tuning alters the internal structure of a model to specialize in new multimodal tasks.
Our work sheds light on how multimodal representations evolve through fine-tuning and offers a new perspective for interpreting model adaptation in multimodal tasks.
arXiv Detail & Related papers (2025-01-06T13:37:13Z) - Towards Universality: Studying Mechanistic Similarity Across Language Model Architectures [49.24097977047392]
We investigate two mainstream architectures for language modeling, namely Transformers and Mambas, to explore the extent of their mechanistic similarity.
We propose to use Sparse Autoencoders (SAEs) to isolate interpretable features from these models and show that most features are similar in these two models.
arXiv Detail & Related papers (2024-10-09T08:28:53Z) - How Transformers Learn Diverse Attention Correlations in Masked Vision Pretraining [66.08606211686339]
We provide the first end-to-end theoretical guarantee of learning one-layer transformers in masked reconstruction self-supervised pretraining.
On the conceptual side, we posit a mechanism of how transformers trained with masked vision pretraining objectives produce empirically observed local and diverse attention patterns.
On the technical side, our end-to-end characterization of training dynamics in softmax-attention models simultaneously accounts for input and position embeddings.
arXiv Detail & Related papers (2024-03-04T17:24:03Z) - A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task [14.921790126851008]
We present a comprehensive mechanistic analysis of a transformer trained on a synthetic reasoning task.
We identify a set of interpretable mechanisms the model uses to solve the task, and validate our findings using correlational and causal evidence.
arXiv Detail & Related papers (2024-02-19T08:04:25Z) - Structured World Representations in Maze-Solving Transformers [3.75591091941815]
This work focuses on the abstractions formed by small transformer models.
We find evidence for the consistent emergence of structured internal representations of maze topology and valid paths.
We also take steps towards deciphering the circuity of path-following by identifying attention heads.
arXiv Detail & Related papers (2023-12-05T08:24:26Z) - Understanding Addition in Transformers [2.07180164747172]
This paper provides a comprehensive analysis of a one-layer Transformer model trained to perform n-digit integer addition.
Our findings suggest that the model dissects the task into parallel streams dedicated to individual digits, employing varied algorithms tailored to different positions within the digits.
arXiv Detail & Related papers (2023-10-19T19:34:42Z) - Emergence of Segmentation with Minimalistic White-Box Transformers [22.688777622988795]
Previous works have shown that segmentation properties emerge in vision transformers (ViTs) trained using self-supervised methods such as DINO, but not in those trained on supervised classification tasks.
In this study, we probe whether segmentation emerges in transformer-based models solely as a result of intricate self-supervised learning mechanisms.
Our results suggest a path to design white-box foundation models that are simultaneously highly performant and mathematically fully interpretable.
arXiv Detail & Related papers (2023-08-30T19:02:17Z) - A Comprehensive Survey on Applications of Transformers for Deep Learning
Tasks [60.38369406877899]
Transformer is a deep neural network that employs a self-attention mechanism to comprehend the contextual relationships within sequential data.
transformer models excel in handling long dependencies between input sequence elements and enable parallel processing.
Our survey encompasses the identification of the top five application domains for transformer-based models.
arXiv Detail & Related papers (2023-06-11T23:13:51Z) - Investigating the role of model-based learning in exploration and
transfer [11.652741003589027]
In this paper, we investigate transfer learning in the context of model-based agents.
We find that a model-based approach outperforms controlled model-free baselines for transfer learning.
Our results show that intrinsic exploration combined with environment models present a viable direction towards agents that are self-supervised and able to generalize to novel reward functions.
arXiv Detail & Related papers (2023-02-08T11:49:58Z) - Generalization Properties of Retrieval-based Models [50.35325326050263]
Retrieval-based machine learning methods have enjoyed success on a wide range of problems.
Despite growing literature showcasing the promise of these models, the theoretical underpinning for such models remains underexplored.
We present a formal treatment of retrieval-based models to characterize their generalization ability.
arXiv Detail & Related papers (2022-10-06T00:33:01Z) - VisBERT: Hidden-State Visualizations for Transformers [66.86452388524886]
We present VisBERT, a tool for visualizing the contextual token representations within BERT for the task of (multi-hop) Question Answering.
VisBERT enables users to get insights about the model's internal state and to explore its inference steps or potential shortcomings.
arXiv Detail & Related papers (2020-11-09T15:37:43Z) - S2RMs: Spatially Structured Recurrent Modules [105.0377129434636]
We take a step towards exploiting dynamic structure that are capable of simultaneously exploiting both modular andtemporal structures.
We find our models to be robust to the number of available views and better capable of generalization to novel tasks without additional training.
arXiv Detail & Related papers (2020-07-13T17:44:30Z) - Deep Autoencoding Topic Model with Scalable Hybrid Bayesian Inference [55.35176938713946]
We develop deep autoencoding topic model (DATM) that uses a hierarchy of gamma distributions to construct its multi-stochastic-layer generative network.
We propose a Weibull upward-downward variational encoder that deterministically propagates information upward via a deep neural network, followed by a downward generative model.
The efficacy and scalability of our models are demonstrated on both unsupervised and supervised learning tasks on big corpora.
arXiv Detail & Related papers (2020-06-15T22:22:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.