Looking Inward: Language Models Can Learn About Themselves by Introspection
- URL: http://arxiv.org/abs/2410.13787v1
- Date: Thu, 17 Oct 2024 17:24:10 GMT
- Title: Looking Inward: Language Models Can Learn About Themselves by Introspection
- Authors: Felix J Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes, Robert Long, Ethan Perez, Miles Turpin, Owain Evans,
- Abstract summary: Introspection gives a person privileged access to their current state of mind.
We define introspection as acquiring knowledge that is not contained in or derived from training data.
We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios.
- Score: 7.544957585111317
- License:
- Abstract: Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect? We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states. Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model's internal workings, we could simply ask the model about its beliefs, world models, and goals. More speculatively, an introspective model might self-report on whether it possesses certain internal states such as subjective feelings or desires and this could inform us about the moral status of these states. Such self-reports would not be entirely dictated by the model's training data. We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios. For example, "Given the input P, would your output favor the short- or long-term option?" If a model M1 can introspect, it should outperform a different model M2 in predicting M1's behavior even if M2 is trained on M1's ground-truth behavior. The idea is that M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M2 (even if M2 is generally stronger). In experiments with GPT-4, GPT-4o, and Llama-3 models (each finetuned to predict itself), we find that the model M1 outperforms M2 in predicting itself, providing evidence for introspection. Notably, M1 continues to predict its behavior accurately even after we intentionally modify its ground-truth behavior. However, while we successfully elicit introspection on simple tasks, we are unsuccessful on more complex tasks or those requiring out-of-distribution generalization.
Related papers
- Great Models Think Alike and this Undermines AI Oversight [47.7725284401918]
We study how model similarity affects both aspects of AI oversight.
We propose a probabilistic metric for LM similarity based on overlap in model mistakes.
Our work underscores the importance of reporting and correcting for model similarity.
arXiv Detail & Related papers (2025-02-06T18:56:01Z) - Tell me about yourself: LLMs are aware of their learned behaviors [3.959641782135808]
Behavioral self-awareness is relevant for AI safety, as models could use it to proactively disclose problematic behaviors.
Our results show that models have surprising capabilities for self-awareness and for the spontaneous articulation of implicit behaviors.
arXiv Detail & Related papers (2025-01-19T17:28:12Z) - Frontier Models are Capable of In-context Scheming [41.30527987937867]
One safety concern is that AI agents might covertly pursue misaligned goals, hiding their true capabilities and objectives.
We evaluate frontier models on a suite of six agentic evaluations where models are instructed to pursue goals.
We find that o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B all demonstrate in-context scheming capabilities.
arXiv Detail & Related papers (2024-12-06T12:09:50Z) - SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs [72.06808538971487]
We test whether large language models (LLMs) can implicitly apply a "theory of mind" (ToM) to predict behavior.
We create a new dataset, SimpleTom, containing stories with three questions that test different degrees of ToM reasoning.
To our knowledge, SimpleToM is the first dataset to explore downstream reasoning requiring knowledge of mental states in realistic scenarios.
arXiv Detail & Related papers (2024-10-17T15:15:00Z) - Self-Debiasing Large Language Models: Zero-Shot Recognition and
Reduction of Stereotypes [73.12947922129261]
We leverage the zero-shot capabilities of large language models to reduce stereotyping.
We show that self-debiasing can significantly reduce the degree of stereotyping across nine different social groups.
We hope this work opens inquiry into other zero-shot techniques for bias mitigation.
arXiv Detail & Related papers (2024-02-03T01:40:11Z) - How FaR Are Large Language Models From Agents with Theory-of-Mind? [69.41586417697732]
We propose a new evaluation paradigm for large language models (LLMs): Thinking for Doing (T4D)
T4D requires models to connect inferences about others' mental states to actions in social scenarios.
We introduce a zero-shot prompting framework, Foresee and Reflect (FaR), which provides a reasoning structure that encourages LLMs to anticipate future challenges.
arXiv Detail & Related papers (2023-10-04T06:47:58Z) - Discovering Latent Knowledge in Language Models Without Supervision [72.95136739040676]
Existing techniques for training language models can be misaligned with the truth.
We propose directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way.
We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models.
arXiv Detail & Related papers (2022-12-07T18:17:56Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - To what extent should we trust AI models when they extrapolate? [0.0]
We show that models extrapolate frequently; the extent of extrapolation varies and can be socially consequential.
This paper investigates several social applications of AI, showing how models extrapolate without notice.
arXiv Detail & Related papers (2022-01-27T01:27:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.