Related papers: Feeling the Strength but Not the Source: Partial Introspection in LLMs

Feeling the Strength but Not the Source: Partial Introspection in LLMs

URL: http://arxiv.org/abs/2512.12411v1
Date: Sat, 13 Dec 2025 17:51:13 GMT
Title: Feeling the Strength but Not the Source: Partial Introspection in LLMs
Authors: Ely Hahami, Lavik Jain, Ishaan Sinha,
Abstract summary: Anthropic claims frontier models can sometimes detect and name injected "concepts" represented as activation directions.<n>We reproduce Anthropic's multi-turn "emergent introspection" result on Meta-Llama-3.1-8B-Instruct.<n>We find that introspection is not exclusive to very large or capable models.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent work from Anthropic claims that frontier models can sometimes detect and name injected "concepts" represented as activation directions. We test the robustness of these claims. First, we reproduce Anthropic's multi-turn "emergent introspection" result on Meta-Llama-3.1-8B-Instruct, finding that the model identifies and names the injected concept 20 percent of the time under Anthropic's original pipeline, exactly matching their reported numbers and thus showing that introspection is not exclusive to very large or capable models. Second, we systematically vary the inference prompt and find that introspection is fragile: performance collapses on closely related tasks such as multiple-choice identification of the injected concept or different prompts of binary discrimination of whether a concept was injected at all. Third, we identify a contrasting regime of partial introspection: the same model can reliably classify the strength of the coefficient of a normalized injected concept vector (as weak / moderate / strong / very strong) with up to 70 percent accuracy, far above the 25 percent chance baseline. Together, these results provide more evidence for Anthropic's claim that language models effectively compute a function of their baseline, internal representations during introspection; however, these self-reports about those representations are narrow and prompt-sensitive. Our code is available at https://github.com/elyhahami18/CS2881-Introspection.

Related papers

Dissociating Direct Access from Inference in AI Introspection [11.31435294855236]
Recent work has shown that AI models can introspect.<n>We show that these models detect injected representations via two separable mechanisms.<n>This content-agnostic introspective mechanism is consistent with leading theories in philosophy and psychology.
arXiv Detail & Related papers (2026-03-05T17:39:37Z)
Counterfactual reasoning: an analysis of in-context emergence [57.118735341305786]
We show that language models are capable of counterfactual reasoning.<n>We find that self-attention, model depth and pre-training data diversity drive performance.<n>Our findings extend to counterfactual reasoning under SDE dynamics.
arXiv Detail & Related papers (2025-06-05T16:02:07Z)
A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models [58.32070787537946]
Chain-of-thought (CoT) reasoning enhances performance of large language models.<n>We present the first comprehensive study of CoT faithfulness in large vision-language models.
arXiv Detail & Related papers (2025-05-29T18:55:05Z)
Concept Incongruence: An Exploration of Time and Death in Role Playing [20.847291173760567]
We take the first step towards defining and analyzing model behavior under concept incongruence.<n>We show that models fail to abstain after death and suffer from an accuracy drop compared to the Non-Role-Play setting.
arXiv Detail & Related papers (2025-05-20T20:59:59Z)
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? [76.15163242945813]
Large language models (LLMs) have led many to conclude that they exhibit a form of intelligence.<n>We introduce a novel generative model that generates tokens on the basis of human-interpretable concepts represented as latent discrete variables.
arXiv Detail & Related papers (2025-03-12T01:21:17Z)
Navigating the OverKill in Large Language Models [84.62340510027042]
We investigate the factors for overkill by exploring how models handle and determine the safety of queries. Our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill. We introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy, to alleviate this phenomenon.
arXiv Detail & Related papers (2024-01-31T07:26:47Z)
GAPX: Generalized Autoregressive Paraphrase-Identification X [24.331570697458954]
A major source of this performance drop comes from biases introduced by negative examples. We introduce a perplexity based out-of-distribution metric that we show can effectively and automatically determine how much weight it should be given during inference.
arXiv Detail & Related papers (2022-10-05T01:23:52Z)
Nested Counterfactual Identification from Arbitrary Surrogate Experiments [95.48089725859298]
We study the identification of nested counterfactuals from an arbitrary combination of observations and experiments. Specifically, we prove the counterfactual unnesting theorem (CUT), which allows one to map arbitrary nested counterfactuals to unnested ones.
arXiv Detail & Related papers (2021-07-07T12:51:04Z)
Contrastive Reasoning in Neural Networks [26.65337569468343]
Inference built on features that identify causal class dependencies is termed as feed-forward inference. In this paper, we formalize the structure of contrastive reasoning and propose a methodology to extract a neural network's notion of contrast. We demonstrate the value of contrastively recognizing images under distortions by reporting an improvement of 3.47%, 2.56%, and 5.48% in average accuracy.
arXiv Detail & Related papers (2021-03-23T05:54:36Z)
Contrastive Explanations for Model Interpretability [77.92370750072831]
We propose a methodology to produce contrastive explanations for classification models. Our method is based on projecting model representation to a latent space. Our findings shed light on the ability of label-contrastive explanations to provide a more accurate and finer-grained interpretability of a model's decision.
arXiv Detail & Related papers (2021-03-02T00:36:45Z)
Dependency Decomposition and a Reject Option for Explainable Models [4.94950858749529]
Recent deep learning models perform extremely well in various inference tasks. Recent advances offer methods to visualize features, describe attribution of the input. We present the first analysis of dependencies regarding the probability distribution over the desired image classification outputs.
arXiv Detail & Related papers (2020-12-11T17:39:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.