Related papers: How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability

How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability

URL: http://arxiv.org/abs/2601.19208v1
Date: Tue, 27 Jan 2026 05:22:34 GMT
Title: How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability
Authors: Shawn Im, Changdae Oh, Zhen Fang, Sharon Li,
Abstract summary: We analyze how associations emerge from natural language data in attention-based language models.<n>We reveal that each set of weights of a transformer has closed-form expressions as simple compositions of three basis functions.
Score: 17.091330039972274
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Semantic associations such as the link between "bird" and "flew" are foundational for language modeling as they enable models to go beyond memorization and instead generalize and generate coherent text. Understanding how these associations are learned and represented in language models is essential for connecting deep learning with linguistic theory and developing a mechanistic foundation for large language models. In this work, we analyze how these associations emerge from natural language data in attention-based language models through the lens of training dynamics. By leveraging a leading-term approximation of the gradients, we develop closed-form expressions for the weights at early stages of training that explain how semantic associations first take shape. Through our analysis, we reveal that each set of weights of the transformer has closed-form expressions as simple compositions of three basis functions (bigram, token-interchangeability, and context mappings), reflecting the statistics of the text corpus and uncovering how each component of the transformer captures semantic associations based on these compositions. Experiments on real-world LLMs demonstrate that our theoretical weight characterizations closely match the learned weights, and qualitative analyses further show how our theorem shines light on interpreting the learned associations in transformers.

Related papers

The Grammar of Transformers: A Systematic Review of Interpretability Research on Syntactic Knowledge in Language Models [3.281168543761194]
We present a systematic review of 337 articles evaluating the syntactic abilities of Transformer-based language models.<n>Results suggest that TLMs capture form-oriented phenomena well, but show more variable and weaker performance on phenomena at the syntax-semantics interface.
arXiv Detail & Related papers (2026-01-09T16:34:19Z)
Large Language Models as Model Organisms for Human Associative Learning [9.196745903193609]
We adapt a cognitive neuroscience associative learning paradigm and investigate how representations evolve across six models.<n>Our initial findings reveal a non-monotonic pattern consistent with the Non-Monotonic Plasticity Hypothesis.<n>We find that higher vocabulary interference amplifies differentiation, suggesting that representational change is influenced by both item similarity and global competition.
arXiv Detail & Related papers (2025-10-24T12:52:11Z)
Beyond the Rosetta Stone: Unification Forces in Generalization Dynamics [56.145578792496714]
Large language models (LLMs) struggle with cross-lingual knowledge transfer.<n>We study the causes and dynamics of this phenomenon by training small Transformer models from scratch on synthetic multilingual datasets.
arXiv Detail & Related papers (2025-08-14T18:44:13Z)
A Markov Categorical Framework for Language Modeling [9.910562011343009]
Autoregressive language models achieve remarkable performance, yet a unified theory explaining their internal mechanisms, how training shapes their representations, and enables complex behaviors, remains elusive.<n>We introduce a new analytical framework that models the single-step generation process as a composition of information-processing stages using the language of Markov categories.<n>This work presents a powerful new lens for understanding how information flows through a model and how the training objective shapes its internal geometry.
arXiv Detail & Related papers (2025-07-25T13:14:03Z)
TRACE for Tracking the Emergence of Semantic Representations in Transformers [10.777646083061395]
We introduce TRACE, a diagnostic framework combining geometric, informational, and linguistic signals to detect phase transitions in Transformer-based LMs.<n>Experiments reveal that phase transitions align with clear intersections between curvature collapse and dimension stabilisation; these geometric shifts coincide with emerging syntactic and semantic accuracy.<n>This work advances our understanding of how linguistic abstractions emerge in LMs, offering insights into model interpretability, training efficiency, and compositional generalisation.
arXiv Detail & Related papers (2025-05-23T15:03:51Z)
Geometry of Semantics in Next-Token Prediction: How Optimization Implicitly Organizes Linguistic Representations [34.88156871518115]
Next-token prediction (NTP) optimization leads language models to extract and organize semantic structure from text.<n>We demonstrate that concepts corresponding to larger singular values are learned earlier during training, yielding a natural semantic hierarchy.<n>This insight motivates orthant-based clustering, a method that combines concept signs to identify interpretable semantic categories.
arXiv Detail & Related papers (2025-05-13T08:46:04Z)
Strengthening Structural Inductive Biases by Pre-training to Perform Syntactic Transformations [75.14793516745374]
We propose to strengthen the structural inductive bias of a Transformer by intermediate pre-training. Our experiments confirm that this helps with few-shot learning of syntactic tasks such as chunking. Our analysis shows that the intermediate pre-training leads to attention heads that keep track of which syntactic transformation needs to be applied to which token.
arXiv Detail & Related papers (2024-07-05T14:29:44Z)
Distributional Associations vs In-Context Reasoning: A Study of Feed-forward and Attention Layers [49.80959223722325]
We study the distinction between feed-forward and attention layers in large language models.<n>We find that feed-forward layers tend to learn simple distributional associations such as bigrams, while attention layers focus on in-context reasoning.
arXiv Detail & Related papers (2024-06-05T08:51:08Z)
Explaining Text Similarity in Transformer Models [52.571158418102584]
Recent advances in explainable AI have made it possible to mitigate limitations by leveraging improved explanations for Transformers. We use BiLRP, an extension developed for computing second-order explanations in bilinear similarity models, to investigate which feature interactions drive similarity in NLP models. Our findings contribute to a deeper understanding of different semantic similarity tasks and models, highlighting how novel explainable AI methods enable in-depth analyses and corpus-level insights.
arXiv Detail & Related papers (2024-05-10T17:11:31Z)
Linearity of Relation Decoding in Transformer Language Models [82.47019600662874]
Much of the knowledge encoded in transformer language models (LMs) may be expressed in terms of relations. We show that, for a subset of relations, this computation is well-approximated by a single linear transformation on the subject representation.
arXiv Detail & Related papers (2023-08-17T17:59:19Z)
Oracle Linguistic Graphs Complement a Pretrained Transformer Language Model: A Cross-formalism Comparison [13.31232311913236]
We examine the extent to which, in principle, linguistic graph representations can complement and improve neural language modeling. We find that, overall, semantic constituency structures are most useful to language modeling performance.
arXiv Detail & Related papers (2021-12-15T04:29:02Z)
Did the Cat Drink the Coffee? Challenging Transformers with Generalized Event Knowledge [59.22170796793179]
Transformers Language Models (TLMs) were tested on a benchmark for the textitdynamic estimation of thematic fit Our results show that TLMs can reach performances that are comparable to those achieved by SDM. However, additional analysis consistently suggests that TLMs do not capture important aspects of event knowledge.
arXiv Detail & Related papers (2021-07-22T20:52:26Z)
Temporal Embeddings and Transformer Models for Narrative Text Understanding [72.88083067388155]
We present two approaches to narrative text understanding for character relationship modelling. The temporal evolution of these relations is described by dynamic word embeddings, that are designed to learn semantic changes over time. A supervised learning approach based on the state-of-the-art transformer model BERT is used instead to detect static relations between characters.
arXiv Detail & Related papers (2020-03-19T14:23:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.