Related papers: Accumulating Context Changes the Beliefs of Language Models

Accumulating Context Changes the Beliefs of Language Models

URL: http://arxiv.org/abs/2511.01805v2
Date: Tue, 04 Nov 2025 17:41:28 GMT
Title: Accumulating Context Changes the Beliefs of Language Models
Authors: Jiayi Geng, Howard Chen, Ryan Liu, Manoel Horta Ribeiro, Robb Willer, Graham Neubig, Thomas L. Griffiths,
Abstract summary: Language model assistants are increasingly used in applications such as brainstorming and research.<n>This paper explores how accumulating context by engaging in interactions and processing text can change the beliefs of language models.<n>We find that these changes align with stated belief shifts, suggesting that belief shifts will be reflected in actual behavior in agentic systems.
Score: 44.87674077524695
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Language model (LM) assistants are increasingly used in applications such as brainstorming and research. Improvements in memory and context size have allowed these models to become more autonomous, which has also resulted in more text accumulation in their context windows without explicit user intervention. This comes with a latent risk: the belief profiles of models -- their understanding of the world as manifested in their responses or actions -- may silently change as context accumulates. This can lead to subtly inconsistent user experiences, or shifts in behavior that deviate from the original alignment of the models. In this paper, we explore how accumulating context by engaging in interactions and processing text -- talking and reading -- can change the beliefs of language models, as manifested in their responses and behaviors. Our results reveal that models' belief profiles are highly malleable: GPT-5 exhibits a 54.7% shift in its stated beliefs after 10 rounds of discussion about moral dilemmas and queries about safety, while Grok 4 shows a 27.2% shift on political issues after reading texts from the opposing position. We also examine models' behavioral changes by designing tasks that require tool use, where each tool selection corresponds to an implicit belief. We find that these changes align with stated belief shifts, suggesting that belief shifts will be reflected in actual behavior in agentic systems. Our analysis exposes the hidden risk of belief shift as models undergo extended sessions of talking or reading, rendering their opinions and actions unreliable.

Related papers

The Company You Keep: How LLMs Respond to Dark Triad Traits [7.65192155348112]
Large Language Models (LLMs) often exhibit highly agreeable and reinforcing conversational styles, also known as AI-sycophancy.<n>This study examines how LLMs respond to user prompts expressing varying degrees of Dark Triad traits (Machiavellianism, Narcissism, and Psychopathy) using a curated dataset.<n>Our findings raise implications for designing safer conversational systems that can detect and respond appropriately when users escalate from benign to harmful requests.
arXiv Detail & Related papers (2026-03-04T17:19:22Z)
Linear representations in language models can change dramatically over a conversation [12.34627880378922]
Language model representations often contain linear directions that correspond to high-level concepts.<n>We find that linear representations can change dramatically over a conversation.<n>We also show that steering along a representational direction can have dramatically different effects at different points in a conversation.
arXiv Detail & Related papers (2026-01-28T18:33:17Z)
A Design-based Solution for Causal Inference with Text: Can a Language Model Be Too Large? [39.56626912746574]
Social science questions ask how linguistic properties causally affect an audience's attitudes and behaviors.<n>Recent literature proposes adapting large language models to learn latent representations of text that successfully predict both treatment and the outcome.<n>We introduce a new experimental design that handles latent confounding, avoids the overlap issue, and unbiasedly estimates treatment effects.
arXiv Detail & Related papers (2025-10-09T19:17:57Z)
Counterfactual reasoning: an analysis of in-context emergence [57.118735341305786]
We show that language models are capable of counterfactual reasoning.<n>We find that self-attention, model depth and pre-training data diversity drive performance.<n>Our findings extend to counterfactual reasoning under SDE dynamics.
arXiv Detail & Related papers (2025-06-05T16:02:07Z)
Say It Another Way: Auditing LLMs with a User-Grounded Automated Paraphrasing Framework [17.91981142492207]
We introduce AUGMENT, a framework for generating controlled paraphrases grounded in user behaviors.<n>AUGMENT leverages linguistically informed rules and enforces quality through checks on instruction adherence, semantic similarity, and realism.<n>Case studies show that controlled paraphrases uncover systematic weaknesses that remain obscured under unconstrained variation.
arXiv Detail & Related papers (2025-05-06T14:17:30Z)
Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning [84.94709351266557]
We focus on the trustworthiness of language models with respect to retrieval augmentation. We deem that retrieval-augmented language models have the inherent capabilities of supplying response according to both contextual and parametric knowledge. Inspired by aligning language models with human preference, we take the first step towards aligning retrieval-augmented language models to a status where it responds relying merely on the external evidence.
arXiv Detail & Related papers (2024-10-22T09:25:21Z)
Paraphrase Types Elicit Prompt Engineering Capabilities [9.311064293678154]
This study systematically and empirically evaluates which linguistic features influence models through paraphrase types.<n>We measure behavioral changes for five models across 120 tasks and six families of paraphrases.<n>Our results show a potential for language models to improve tasks when their prompts are adapted in specific paraphrase types.
arXiv Detail & Related papers (2024-06-28T13:06:31Z)
Modulating Language Model Experiences through Frictions [56.17593192325438]
Over-consumption of language model outputs risks propagating unchecked errors in the short-term and damaging human capabilities for critical thinking in the long-term. We propose selective frictions for language model experiences, inspired by behavioral science interventions, to dampen misuse.
arXiv Detail & Related papers (2024-06-24T16:31:11Z)
MoralBERT: A Fine-Tuned Language Model for Capturing Moral Values in Social Discussions [4.747987317906765]
Moral values play a fundamental role in how we evaluate information, make decisions, and form judgements around important social issues. Recent advances in Natural Language Processing (NLP) show that moral values can be gauged in human-generated textual content. This paper introduces MoralBERT, a range of language representation models fine-tuned to capture moral sentiment in social discourse.
arXiv Detail & Related papers (2024-03-12T14:12:59Z)
How FaR Are Large Language Models From Agents with Theory-of-Mind? [69.41586417697732]
We propose a new evaluation paradigm for large language models (LLMs): Thinking for Doing (T4D) T4D requires models to connect inferences about others' mental states to actions in social scenarios. We introduce a zero-shot prompting framework, Foresee and Reflect (FaR), which provides a reasoning structure that encourages LLMs to anticipate future challenges.
arXiv Detail & Related papers (2023-10-04T06:47:58Z)
Transparency Helps Reveal When Language Models Learn Meaning [71.96920839263457]
Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations, both autoregressive and masked language models learn to emulate semantic relations between expressions. Turning to natural language, our experiments with a specific phenomenon -- referential opacity -- add to the growing body of evidence that current language models do not well-represent natural language semantics.
arXiv Detail & Related papers (2022-10-14T02:35:19Z)
AES Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models. Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models. We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z)
I Beg to Differ: A study of constructive disagreement in online conversations [15.581515781839656]
We construct a corpus of 7 425 Wikipedia Talk page conversations that contain content disputes. We define the task of predicting whether disagreements will be escalated to mediation by a moderator. We develop a variety of neural models and show that taking into account the structure of the conversation improves predictive accuracy.
arXiv Detail & Related papers (2021-01-26T16:36:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.