Emergent Introspective Awareness in Large Language Models
- URL: http://arxiv.org/abs/2601.01828v1
- Date: Mon, 05 Jan 2026 06:47:41 GMT
- Title: Emergent Introspective Awareness in Large Language Models
- Authors: Jack Lindsey,
- Abstract summary: We investigate whether large language models can introspect on their internal states.<n>We find that models can, in certain scenarios, notice the presence of injected concepts and accurately identify them.<n>Claude Opus 4 and 4.1, the most capable models, generally demonstrate the greatest introspective awareness.
- Score: 2.2458442204933
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate whether large language models can introspect on their internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model's activations, and measuring the influence of these manipulations on the model's self-reported states. We find that models can, in certain scenarios, notice the presence of injected concepts and accurately identify them. Models demonstrate some ability to recall prior internal representations and distinguish them from raw text inputs. Strikingly, we find that some models can use their ability to recall prior intentions in order to distinguish their own outputs from artificial prefills. In all these experiments, Claude Opus 4 and 4.1, the most capable models we tested, generally demonstrate the greatest introspective awareness; however, trends across models are complex and sensitive to post-training strategies. Finally, we explore whether models can explicitly control their internal representations, finding that models can modulate their activations when instructed or incentivized to "think about" a concept. Overall, our results indicate that current language models possess some functional introspective awareness of their own internal states. We stress that in today's models, this capacity is highly unreliable and context-dependent; however, it may continue to develop with further improvements to model capabilities.
Related papers
- Fluid Representations in Reasoning Models [91.77876704697779]
We present a mechanistic analysis of how QwQ-32B processes abstract structural information.<n>We find that QwQ-32B gradually improves its internal representation of actions and concepts during reasoning.
arXiv Detail & Related papers (2026-02-04T18:34:50Z) - Predicting the Performance of Black-box LLMs through Self-Queries [60.87193950962585]
Large language models (LLMs) are increasingly relied on in AI systems, predicting when they make mistakes is crucial.<n>In this paper, we extract features of LLMs in a black-box manner by using follow-up prompts and taking the probabilities of different responses as representations.<n>We demonstrate that training a linear model on these low-dimensional representations produces reliable predictors of model performance at the instance level.
arXiv Detail & Related papers (2025-01-02T22:26:54Z) - On the Fairness, Diversity and Reliability of Text-to-Image Generative Models [68.62012304574012]
multimodal generative models have sparked critical discussions on their reliability, fairness and potential for misuse.<n>We propose an evaluation framework to assess model reliability by analyzing responses to global and local perturbations in the embedding space.<n>Our method lays the groundwork for detecting unreliable, bias-injected models and tracing the provenance of embedded biases.
arXiv Detail & Related papers (2024-11-21T09:46:55Z) - From Imitation to Introspection: Probing Self-Consciousness in Language Models [8.357696451703058]
Self-consciousness is the introspection of one's existence and thoughts.
This work presents a practical definition of self-consciousness for language models.
arXiv Detail & Related papers (2024-10-24T15:08:17Z) - Eureka: Evaluating and Understanding Large Foundation Models [23.020996995362104]
We present Eureka, an open-source framework for standardizing evaluations of large foundation models beyond single-score reporting and rankings.
We conduct an analysis of 12 state-of-the-art models, providing in-depth insights into failure understanding and model comparison.
arXiv Detail & Related papers (2024-09-13T18:01:49Z) - Measuring Agreeableness Bias in Multimodal Models [0.3529736140137004]
This paper examines a phenomenon in multimodal language models where pre-marked options in question images can influence model responses.
We present models with images of multiple-choice questions, which they initially answer correctly, then expose the same model to versions with pre-marked options.
Our findings reveal a significant shift in the models' responses towards the pre-marked option, even when it contradicts their answers in the neutral settings.
arXiv Detail & Related papers (2024-08-17T06:25:36Z) - Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models [9.318796743761224]
Despite growing interest in Theory of Mind (ToM) tasks for evaluating language models (LMs), little is known about how LMs internally represent mental states of self and others.<n>We present the first systematic investigation of belief representations in LMs by probing models across different scales, training regimens, and prompts.<n>Our experiments provide evidence that both model size and fine-tuning substantially improve LMs' internal representations of others' beliefs, which are structured - not mere by-products of spurious correlations - yet brittle to prompt variations.
arXiv Detail & Related papers (2024-06-25T12:51:06Z) - Turning large language models into cognitive models [0.0]
We show that large language models can be turned into cognitive models.
These models offer accurate representations of human behavior, even outperforming traditional cognitive models in two decision-making domains.
Taken together, these results suggest that large, pre-trained models can be adapted to become generalist cognitive models.
arXiv Detail & Related papers (2023-06-06T18:00:01Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - Towards Interpretable Deep Reinforcement Learning Models via Inverse
Reinforcement Learning [27.841725567976315]
We propose a novel framework utilizing Adversarial Inverse Reinforcement Learning.
This framework provides global explanations for decisions made by a Reinforcement Learning model.
We capture intuitive tendencies that the model follows by summarizing the model's decision-making process.
arXiv Detail & Related papers (2022-03-30T17:01:59Z) - Explain, Edit, and Understand: Rethinking User Study Design for
Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews.
We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z) - Beyond Trivial Counterfactual Explanations with Diverse Valuable
Explanations [64.85696493596821]
In computer vision applications, generative counterfactual methods indicate how to perturb a model's input to change its prediction.
We propose a counterfactual method that learns a perturbation in a disentangled latent space that is constrained using a diversity-enforcing loss.
Our model improves the success rate of producing high-quality valuable explanations when compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2021-03-18T12:57:34Z) - Residual Energy-Based Models for Text [46.22375671394882]
We show that the generations of auto-regressive language models can be reliably distinguished from real text by statistical discriminators.
This suggests that the auto-regressive models can be improved by incorporating the (globally normalized) discriminators into the generative process.
arXiv Detail & Related papers (2020-04-06T13:44:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.