Related papers: Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models

Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models

URL: http://arxiv.org/abs/2406.17513v3
Date: Mon, 19 May 2025 16:43:13 GMT
Title: Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models
Authors: Matteo Bortoletto, Constantin Ruhdorfer, Lei Shi, Andreas Bulling,
Abstract summary: Despite growing interest in Theory of Mind (ToM) tasks for evaluating language models (LMs), little is known about how LMs internally represent mental states of self and others.<n>We present the first systematic investigation of belief representations in LMs by probing models across different scales, training regimens, and prompts.<n>Our experiments provide evidence that both model size and fine-tuning substantially improve LMs' internal representations of others' beliefs, which are structured - not mere by-products of spurious correlations - yet brittle to prompt variations.
Score: 9.318796743761224
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Despite growing interest in Theory of Mind (ToM) tasks for evaluating language models (LMs), little is known about how LMs internally represent mental states of self and others. Understanding these internal mechanisms is critical - not only to move beyond surface-level performance, but also for model alignment and safety, where subtle misattributions of mental states may go undetected in generated outputs. In this work, we present the first systematic investigation of belief representations in LMs by probing models across different scales, training regimens, and prompts - using control tasks to rule out confounds. Our experiments provide evidence that both model size and fine-tuning substantially improve LMs' internal representations of others' beliefs, which are structured - not mere by-products of spurious correlations - yet brittle to prompt variations. Crucially, we show that these representations can be strengthened: targeted edits to model activations can correct wrong ToM inferences.

Related papers

Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation [54.3628937181904]
Internal world models (WMs) enable agents to understand the world's state and predict transitions.<n>Recent large Vision-Language Models (VLMs), such as OpenAI o3, GPT-4o and Gemini, exhibit potential as general-purpose WMs.
arXiv Detail & Related papers (2025-06-27T03:24:29Z)
Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling [56.26834106704781]
Factual incorrectness in generated content is one of the primary concerns in ubiquitous deployment of large language models (LLMs)<n>We provide evidence supporting the presence of LLMs' internal compass that dictate the correctness of factual recall at the time of generation.<n>Scaling experiments across model sizes and training dynamics highlight that self-awareness emerges rapidly during training and peaks in intermediate layers.
arXiv Detail & Related papers (2025-05-27T16:24:02Z)
ThinkPatterns-21k: A Systematic Study on the Impact of Thinking Patterns in LLMs [15.798087244817134]
We conduct a comprehensive analysis of the impact of various thinking types on model performance. We introduce ThinkPatterns-21k, a curated dataset comprising 21k instruction-response pairs. We have two key findings: (1) smaller models (30B parameters) can benefit from most of structured thinking patterns, while larger models (32B) with structured thinking like decomposition would degrade performance.
arXiv Detail & Related papers (2025-03-17T08:29:04Z)
Beyond Pattern Recognition: Probing Mental Representations of LMs [9.461066161954077]
Language Models (LMs) have demonstrated impressive capabilities in solving complex reasoning tasks. We propose to delve deeper into the mental model of various LMs.
arXiv Detail & Related papers (2025-02-23T21:20:28Z)
Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.40788744292739]
We propose a two-player paradigm that separates the roles of reasoning and critique models. We first propose AutoMathCritique, an automated and scalable framework for collecting critique data. We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time.
arXiv Detail & Related papers (2024-11-25T17:11:54Z)
Explanatory Model Monitoring to Understand the Effects of Feature Shifts on Performance [61.06245197347139]
We propose a novel approach to explain the behavior of a black-box model under feature shifts. We refer to our method that combines concepts from Optimal Transport and Shapley Values as Explanatory Performance Estimation.
arXiv Detail & Related papers (2024-08-24T18:28:19Z)
Estimating Knowledge in Large Language Models Without Generating a Single Token [12.913172023910203]
Current methods to evaluate knowledge in large language models (LLMs) query the model and then evaluate its generated responses. In this work, we ask whether evaluation can be done before the model has generated any text. Experiments with a variety of LLMs show that KEEN, a simple probe trained over internal subject representations, succeeds at both tasks.
arXiv Detail & Related papers (2024-06-18T14:45:50Z)
Calibrating Reasoning in Language Models with Internal Consistency [18.24350001344488]
Large language models (LLMs) have demonstrated impressive capabilities in various reasoning tasks.<n>LLMs often generate text with obvious mistakes and contradictions.<n>In this work, we investigate reasoning in LLMs through the lens of internal representations.
arXiv Detail & Related papers (2024-05-29T02:44:12Z)
Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via Self-Evaluation [71.91287418249688]
Large language models (LLMs) often struggle with factual inaccuracies, even when they hold relevant knowledge. We leverage the self-evaluation capability of an LLM to provide training signals that steer the model towards factuality. We show that the proposed self-alignment approach substantially enhances factual accuracy over Llama family models across three key knowledge-intensive tasks.
arXiv Detail & Related papers (2024-02-14T15:52:42Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
N-Critics: Self-Refinement of Large Language Models with Ensemble of Critics [5.516095889257118]
We propose a self-correction mechanism for Large Language Models (LLMs) to mitigate issues such as toxicity and fact hallucination. This method involves refining model outputs through an ensemble of critics and the model's own feedback.
arXiv Detail & Related papers (2023-10-28T11:22:22Z)
Understanding the Inner Workings of Language Models Through Representation Dissimilarity [5.987278280211877]
representation dissimilarity measures are functions that measure the extent to which two model's internal representations differ. Our results suggest that dissimilarity measures are a promising set of tools for shedding light on the inner workings of language models.
arXiv Detail & Related papers (2023-10-23T14:46:20Z)
A Comprehensive Evaluation and Analysis Study for Chinese Spelling Check [53.152011258252315]
We show that using phonetic and graphic information reasonably is effective for Chinese Spelling Check. Models are sensitive to the error distribution of the test set, which reflects the shortcomings of models. The commonly used benchmark, SIGHAN, can not reliably evaluate models' performance.
arXiv Detail & Related papers (2023-07-25T17:02:38Z)
Turning large language models into cognitive models [0.0]
We show that large language models can be turned into cognitive models. These models offer accurate representations of human behavior, even outperforming traditional cognitive models in two decision-making domains. Taken together, these results suggest that large, pre-trained models can be adapted to become generalist cognitive models.
arXiv Detail & Related papers (2023-06-06T18:00:01Z)
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [139.77117915309023]
CRITIC allows large language models to validate and amend their own outputs in a manner similar to human interaction with tools. Comprehensive evaluations involving free-form question answering, mathematical program synthesis, and toxicity reduction demonstrate that CRITIC consistently enhances the performance of LLMs.
arXiv Detail & Related papers (2023-05-19T15:19:44Z)
Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP) What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining. How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z)
Task Formulation Matters When Learning Continually: A Case Study in Visual Question Answering [58.82325933356066]
Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge. We present a detailed study of how different settings affect performance for Visual Question Answering.
arXiv Detail & Related papers (2022-09-30T19:12:58Z)
Explain, Edit, and Understand: Rethinking User Study Design for Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews. We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z)
Rethinking Generalization of Neural Models: A Named Entity Recognition Case Study [81.11161697133095]
We take the NER task as a testbed to analyze the generalization behavior of existing models from different perspectives. Experiments with in-depth analyses diagnose the bottleneck of existing neural NER models. As a by-product of this paper, we have open-sourced a project that involves a comprehensive summary of recent NER papers.
arXiv Detail & Related papers (2020-01-12T04:33:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.