Related papers: Looking Inward: Language Models Can Learn About Themselves by Introspection

Looking Inward: Language Models Can Learn About Themselves by Introspection

URL: http://arxiv.org/abs/2410.13787v1
Date: Thu, 17 Oct 2024 17:24:10 GMT
Title: Looking Inward: Language Models Can Learn About Themselves by Introspection
Authors: Felix J Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes, Robert Long, Ethan Perez, Miles Turpin, Owain Evans,
Abstract summary: Introspection gives a person privileged access to their current state of mind. We define introspection as acquiring knowledge that is not contained in or derived from training data. We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios.
Score: 7.544957585111317
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect? We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states. Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model's internal workings, we could simply ask the model about its beliefs, world models, and goals. More speculatively, an introspective model might self-report on whether it possesses certain internal states such as subjective feelings or desires and this could inform us about the moral status of these states. Such self-reports would not be entirely dictated by the model's training data. We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios. For example, "Given the input P, would your output favor the short- or long-term option?" If a model M1 can introspect, it should outperform a different model M2 in predicting M1's behavior even if M2 is trained on M1's ground-truth behavior. The idea is that M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M2 (even if M2 is generally stronger). In experiments with GPT-4, GPT-4o, and Llama-3 models (each finetuned to predict itself), we find that the model M1 outperforms M2 in predicting itself, providing evidence for introspection. Notably, M1 continues to predict its behavior accurately even after we intentionally modify its ground-truth behavior. However, while we successfully elicit introspection on simple tasks, we are unsuccessful on more complex tasks or those requiring out-of-distribution generalization.

Related papers

Language Models Fail to Introspect About Their Knowledge of Language [13.743212705122751]
We systematically investigate emergent introspection across 21 open-source language models. We evaluate whether models' responses to metalinguistic prompts faithfully reflect their internal knowledge. We propose a new measure of introspection: the degree to which a model's prompted responses predict its own string probabilities.
arXiv Detail & Related papers (2025-03-10T16:33:14Z)
Great Models Think Alike and this Undermines AI Oversight [47.7725284401918]
We study how model similarity affects both aspects of AI oversight. We propose a probabilistic metric for LM similarity based on overlap in model mistakes. Our work underscores the importance of reporting and correcting for model similarity.
arXiv Detail & Related papers (2025-02-06T18:56:01Z)
Tell me about yourself: LLMs are aware of their learned behaviors [3.959641782135808]
Behavioral self-awareness is relevant for AI safety, as models could use it to proactively disclose problematic behaviors. Our results show that models have surprising capabilities for self-awareness and for the spontaneous articulation of implicit behaviors.
arXiv Detail & Related papers (2025-01-19T17:28:12Z)
Frontier Models are Capable of In-context Scheming [41.30527987937867]
One safety concern is that AI agents might covertly pursue misaligned goals, hiding their true capabilities and objectives. We evaluate frontier models on a suite of six agentic evaluations where models are instructed to pursue goals. We find that o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B all demonstrate in-context scheming capabilities.
arXiv Detail & Related papers (2024-12-06T12:09:50Z)
SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs [72.06808538971487]
We test whether large language models (LLMs) can implicitly apply a "theory of mind" (ToM) to predict behavior. We create a new dataset, SimpleTom, containing stories with three questions that test different degrees of ToM reasoning. To our knowledge, SimpleToM is the first dataset to explore downstream reasoning requiring knowledge of mental states in realistic scenarios.
arXiv Detail & Related papers (2024-10-17T15:15:00Z)
Bias Similarity Across Large Language Models [32.0365189539138]
Bias in machine learning models has been a chronic problem. We take a comprehensive look at ten open- and closed-source Large Language Models. We measure functional similarity to understand how biases manifest across models.
arXiv Detail & Related papers (2024-10-15T19:21:14Z)
Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models [9.318796743761224]
Despite growing interest in Theory of Mind (ToM) tasks for evaluating language models (LMs), little is known about how LMs internally represent mental states of self and others.<n>We present the first systematic investigation of belief representations in LMs by probing models across different scales, training regimens, and prompts.<n>Our experiments provide evidence that both model size and fine-tuning substantially improve LMs' internal representations of others' beliefs, which are structured - not mere by-products of spurious correlations - yet brittle to prompt variations.
arXiv Detail & Related papers (2024-06-25T12:51:06Z)
Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement [75.7148545929689]
Large language models (LLMs) improve their performance through self-feedback on certain tasks while degrade on others. We formally define LLM's self-bias - the tendency to favor its own generation. We analyze six LLMs on translation, constrained text generation, and mathematical reasoning tasks.
arXiv Detail & Related papers (2024-02-18T03:10:39Z)
Self-Debiasing Large Language Models: Zero-Shot Recognition and Reduction of Stereotypes [73.12947922129261]
We leverage the zero-shot capabilities of large language models to reduce stereotyping. We show that self-debiasing can significantly reduce the degree of stereotyping across nine different social groups. We hope this work opens inquiry into other zero-shot techniques for bias mitigation.
arXiv Detail & Related papers (2024-02-03T01:40:11Z)
How FaR Are Large Language Models From Agents with Theory-of-Mind? [69.41586417697732]
We propose a new evaluation paradigm for large language models (LLMs): Thinking for Doing (T4D) T4D requires models to connect inferences about others' mental states to actions in social scenarios. We introduce a zero-shot prompting framework, Foresee and Reflect (FaR), which provides a reasoning structure that encourages LLMs to anticipate future challenges.
arXiv Detail & Related papers (2023-10-04T06:47:58Z)
Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games [14.063311955315077]
Large language models (LLMs) are effective at answering questions that are clearly asked. When faced with ambiguous queries they can act unpredictably and produce incorrect outputs. This underscores the need for the development of intelligent agents capable of asking clarification questions to resolve ambiguities effectively.
arXiv Detail & Related papers (2023-10-02T16:55:37Z)
The Capacity for Moral Self-Correction in Large Language Models [17.865286693602656]
We test the hypothesis that language models trained with reinforcement learning from human feedback have the capability to "morally self-correct" We find strong evidence in support of this hypothesis across three different experiments. We believe our results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles.
arXiv Detail & Related papers (2023-02-15T04:25:40Z)
Discovering Latent Knowledge in Language Models Without Supervision [72.95136739040676]
Existing techniques for training language models can be misaligned with the truth. We propose directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models.
arXiv Detail & Related papers (2022-12-07T18:17:56Z)
Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP) What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining. How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z)
To what extent should we trust AI models when they extrapolate? [0.0]
We show that models extrapolate frequently; the extent of extrapolation varies and can be socially consequential. This paper investigates several social applications of AI, showing how models extrapolate without notice.
arXiv Detail & Related papers (2022-01-27T01:27:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.