Related papers: Dissociating Direct Access from Inference in AI Introspection

Dissociating Direct Access from Inference in AI Introspection

URL: http://arxiv.org/abs/2603.05414v1
Date: Thu, 05 Mar 2026 17:39:37 GMT
Title: Dissociating Direct Access from Inference in AI Introspection
Authors: Harvey Lederman, Kyle Mahowald,
Abstract summary: Recent work has shown that AI models can introspect.<n>We show that these models detect injected representations via two separable mechanisms.<n>This content-agnostic introspective mechanism is consistent with leading theories in philosophy and psychology.
Score: 11.31435294855236
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Introspection is a foundational cognitive ability, but its mechanism is not well understood. Recent work has shown that AI models can introspect. We study their mechanism of introspection, first extensively replicating Lindsey et al. (2025)'s thought injection detection paradigm in large open-source models. We show that these models detect injected representations via two separable mechanisms: (i) probability-matching (inferring from perceived anomaly of the prompt) and (ii) direct access to internal states. The direct access mechanism is content-agnostic: models detect that an anomaly occurred but cannot reliably identify its semantic content. The two model classes we study confabulate injected concepts that are high-frequency and concrete (e.g., "apple'"); for them correct concept guesses typically require significantly more tokens. This content-agnostic introspective mechanism is consistent with leading theories in philosophy and psychology.

Related papers

Feeling the Strength but Not the Source: Partial Introspection in LLMs [0.0]
Anthropic claims frontier models can sometimes detect and name injected "concepts" represented as activation directions.<n>We reproduce Anthropic's multi-turn "emergent introspection" result on Meta-Llama-3.1-8B-Instruct.<n>We find that introspection is not exclusive to very large or capable models.
arXiv Detail & Related papers (2025-12-13T17:51:13Z)
Know Thyself? On the Incapability and Implications of AI Self-Recognition [22.582593406983907]
Self-recognition is a crucial metacognitive capability for AI systems, relevant not only for psychological analysis but also for safety.<n>We introduce a systematic evaluation framework that can be easily applied and updated.<n>We measure how well 10 contemporary larger language models (LLMs) can identify their own generated text versus text from other models.
arXiv Detail & Related papers (2025-10-03T18:00:01Z)
Understanding Matching Mechanisms in Cross-Encoders [11.192264101562786]
Cross-encoders are highly effective models whose internal mechanisms are mostly unknown.<n>Most works trying to explain their behavior focus on high-level processes.<n>We demonstrate that more straightforward methods can already provide valuable insights.
arXiv Detail & Related papers (2025-07-19T13:05:27Z)
Meta-Representational Predictive Coding: Biomimetic Self-Supervised Learning [51.22185316175418]
We present a new form of predictive coding that we call meta-representational predictive coding (MPC)<n>MPC sidesteps the need for learning a generative model of sensory input by learning to predict representations of sensory input across parallel streams.
arXiv Detail & Related papers (2025-03-22T22:13:14Z)
Class-wise Activation Unravelling the Engima of Deep Double Descent [0.0]
Double descent presents a counter-intuitive aspect within the machine learning domain. In this study, we revisited the phenomenon of double descent and discussed the conditions of its occurrence.
arXiv Detail & Related papers (2024-05-13T12:07:48Z)
Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals [82.68757839524677]
Interpretability research aims to bridge the gap between empirical success and our scientific understanding of large language models (LLMs) We propose a formulation of competition of mechanisms, which focuses on the interplay of multiple mechanisms instead of individual mechanisms. Our findings show traces of the mechanisms and their competition across various model components and reveal attention positions that effectively control the strength of certain mechanisms.
arXiv Detail & Related papers (2024-02-18T17:26:51Z)
Guiding Visual Question Answering with Attention Priors [76.21671164766073]
We propose to guide the attention mechanism using explicit linguistic-visual grounding. This grounding is derived by connecting structured linguistic concepts in the query to their referents among the visual objects. The resultant algorithm is capable of probing attention-based reasoning models, injecting relevant associative knowledge, and regulating the core reasoning process.
arXiv Detail & Related papers (2022-05-25T09:53:47Z)
Properties from Mechanisms: An Equivariance Perspective on Identifiable Representation Learning [79.4957965474334]
Key goal of unsupervised representation learning is "inverting" a data generating process to recover its latent properties. This paper asks, "Can we instead identify latent properties by leveraging knowledge of the mechanisms that govern their evolution?" We provide a complete characterization of the sources of non-identifiability as we vary knowledge about a set of possible mechanisms.
arXiv Detail & Related papers (2021-10-29T14:04:08Z)
ACRE: Abstract Causal REasoning Beyond Covariation [90.99059920286484]
We introduce the Abstract Causal REasoning dataset for systematic evaluation of current vision systems in causal induction. Motivated by the stream of research on causal discovery in Blicket experiments, we query a visual reasoning system with the following four types of questions in either an independent scenario or an interventional scenario. We notice that pure neural models tend towards an associative strategy under their chance-level performance, whereas neuro-symbolic combinations struggle in backward-blocking reasoning.
arXiv Detail & Related papers (2021-03-26T02:42:38Z)
Plausible Reasoning about EL-Ontologies using Concept Interpolation [27.314325986689752]
We propose an inductive mechanism which is based on a clear model-theoretic semantics, and can thus be tightly integrated with standard deductive reasoning. We focus on inference, a powerful commonsense reasoning mechanism which is closely related to cognitive models of category-based induction.
arXiv Detail & Related papers (2020-06-25T14:19:41Z)
Machine Common Sense [77.34726150561087]
Machine common sense remains a broad, potentially unbounded problem in artificial intelligence (AI) This article deals with the aspects of modeling commonsense reasoning focusing on such domain as interpersonal interactions.
arXiv Detail & Related papers (2020-06-15T13:59:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.