Related papers: Linear representations in language models can change dramatically over a conversation

Linear representations in language models can change dramatically over a conversation

URL: http://arxiv.org/abs/2601.20834v2
Date: Mon, 02 Feb 2026 21:30:09 GMT
Title: Linear representations in language models can change dramatically over a conversation
Authors: Andrew Kyle Lampinen, Yuxuan Li, Eghbal Hosseini, Sangnie Bhardwaj, Murray Shanahan,
Abstract summary: Language model representations often contain linear directions that correspond to high-level concepts.<n>We find that linear representations can change dramatically over a conversation.<n>We also show that steering along a representational direction can have dramatically different effects at different points in a conversation.
Score: 12.34627880378922
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language model representations often contain linear directions that correspond to high-level concepts. Here, we study the dynamics of these representations: how representations evolve along these dimensions within the context of (simulated) conversations. We find that linear representations can change dramatically over a conversation; for example, information that is represented as factual at the beginning of a conversation can be represented as non-factual at the end and vice versa. These changes are content-dependent; while representations of conversation-relevant information may change, generic information is generally preserved. These changes are robust even for dimensions that disentangle factuality from more superficial response patterns, and occur across different model families and layers of the model. These representation changes do not require on-policy conversations; even replaying a conversation script written by an entirely different model can produce similar changes. However, adaptation is much weaker from simply having a sci-fi story in context that is framed more explicitly as such. We also show that steering along a representational direction can have dramatically different effects at different points in a conversation. These results are consistent with the idea that representations may evolve in response to the model playing a particular role that is cued by a conversation. Our findings may pose challenges for interpretability and steering -- in particular, they imply that it may be misleading to use static interpretations of features or directions, or probes that assume a particular range of features consistently corresponds to a particular ground-truth value. However, these types of representational dynamics also point to exciting new research directions for understanding how models adapt to context.

Related papers

Emergence of Linear Truth Encodings in Language Models [64.86571541830598]
Large language models exhibit linear subspaces that separate true from false statements, yet the mechanism behind their emergence is unclear.<n>We introduce a transparent, one-layer transformer toy model that reproduces such truth subspaces end-to-end.<n>We study one simple setting in which truth encoding can emerge, encouraging the model to learn this distinction in order to lower the LM loss on future tokens.
arXiv Detail & Related papers (2025-10-17T16:30:07Z)
Counterfactual reasoning: an analysis of in-context emergence [57.118735341305786]
We show that language models are capable of counterfactual reasoning.<n>We find that self-attention, model depth and pre-training data diversity drive performance.<n>Our findings extend to counterfactual reasoning under SDE dynamics.
arXiv Detail & Related papers (2025-06-05T16:02:07Z)
Gender Bias in Instruction-Guided Speech Synthesis Models [55.2480439325792]
This study investigates the potential gender bias in how models interpret occupation-related prompts.<n>We explore whether these models exhibit tendencies to amplify gender stereotypes when interpreting such prompts.<n>Our experimental results reveal the model's tendency to exhibit gender bias for certain occupations.
arXiv Detail & Related papers (2025-02-08T17:38:24Z)
ICLR: In-Context Learning of Representations [19.331483579806623]
We show that as the amount of context is scaled, there is a sudden re-organization from pretrained semantic representations to in-context representations aligned with the graph structure.<n>Our findings indicate scaling context-size can flexibly re-organize model representations, possibly unlocking novel capabilities.
arXiv Detail & Related papers (2024-12-29T18:58:09Z)
Representations as Language: An Information-Theoretic Framework for Interpretability [7.2129390689756185]
Large scale neural models show impressive performance across a wide array of linguistic tasks. Despite this they remain, largely, black-boxes, inducing vector-representations of their input that prove difficult to interpret. We introduce a novel approach to interpretability that looks at the mapping a model learns from sentences to representations as a kind of language in its own right.
arXiv Detail & Related papers (2024-06-04T16:14:00Z)
Iconic Gesture Semantics [87.00251241246136]
Informational evaluation is spelled out as extended exemplification (extemplification) in terms of perceptual classification of a gesture's visual iconic model. We argue that the perceptual classification of instances of visual communication requires a notion of meaning different from Frege/Montague frameworks. An iconic gesture semantics is introduced which covers the full range from gesture representations over model-theoretic evaluation to inferential interpretation in dynamic semantic frameworks.
arXiv Detail & Related papers (2024-04-29T13:58:03Z)
A Practical Method for Generating String Counterfactuals [106.98481791980367]
Interventions targeting the representation space of language models (LMs) have emerged as an effective means to influence model behavior.<n>We give a method to convert representation counterfactuals into string counterfactuals.<n>The resulting counterfactuals can be used to mitigate bias in classification through data augmentation.
arXiv Detail & Related papers (2024-02-17T18:12:02Z)
Probabilistic Transformer: A Probabilistic Dependency Model for Contextual Word Representation [52.270712965271656]
We propose a new model of contextual word representation, not from a neural perspective, but from a purely syntactic and probabilistic perspective. We find that the graph of our model resembles transformers, with correspondences between dependencies and self-attention. Experiments show that our model performs competitively to transformers on small to medium sized datasets.
arXiv Detail & Related papers (2023-11-26T06:56:02Z)
Psychologically-informed chain-of-thought prompts for metaphor understanding in large language models [29.993190226231793]
We use chain-of-thought prompts to introduce structures from probabilistic models into large language models. Our prompts lead language models to infer latent variables and reason about their relationships in order to choose appropriate paraphrases for metaphors.
arXiv Detail & Related papers (2022-09-16T19:23:13Z)
Lost in Context? On the Sense-wise Variance of Contextualized Word Embeddings [11.475144702935568]
We quantify how much the contextualized embeddings of each word sense vary across contexts in typical pre-trained models. We find that word representations are position-biased, where the first words in different contexts tend to be more similar.
arXiv Detail & Related papers (2022-08-20T12:27:25Z)
Interpreting Language Models with Contrastive Explanations [99.7035899290924]
Language models must consider various features to predict a token, such as its part of speech, number, tense, or semantics. Existing explanation methods conflate evidence for all these features into a single explanation, which is less interpretable for human understanding. We show that contrastive explanations are quantifiably better than non-contrastive explanations in verifying major grammatical phenomena.
arXiv Detail & Related papers (2022-02-21T18:32:24Z)
Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis [68.76620947298595]
Text does not fully specify the spoken form, so text-to-speech models must be able to learn from speech data that vary in ways not explained by the corresponding text. We propose a model that generates speech explicitly conditioned on the three primary acoustic correlates of prosody.
arXiv Detail & Related papers (2021-06-15T18:03:48Z)
Implicit Representations of Meaning in Neural Language Models [31.71898809435222]
We identify contextual word representations that function as models of entities and situations as they evolve throughout a discourse. Our results indicate that prediction in pretrained neural language models is supported, at least in part, by dynamic representations of meaning and implicit simulation of entity state.
arXiv Detail & Related papers (2021-06-01T19:23:20Z)
Assessing Phrasal Representation and Composition in Transformers [13.460125148455143]
Deep transformer models have pushed performance on NLP tasks to new limits. We present systematic analysis of phrasal representations in state-of-the-art pre-trained transformers. We find that phrase representation in these models relies heavily on word content, with little evidence of nuanced composition.
arXiv Detail & Related papers (2020-10-08T04:59:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.