Related papers: Know Thyself? On the Incapability and Implications of AI Self-Recognition

Know Thyself? On the Incapability and Implications of AI Self-Recognition

URL: http://arxiv.org/abs/2510.03399v1
Date: Fri, 03 Oct 2025 18:00:01 GMT
Title: Know Thyself? On the Incapability and Implications of AI Self-Recognition
Authors: Xiaoyan Bai, Aryan Shrivastava, Ari Holtzman, Chenhao Tan,
Abstract summary: Self-recognition is a crucial metacognitive capability for AI systems, relevant not only for psychological analysis but also for safety.<n>We introduce a systematic evaluation framework that can be easily applied and updated.<n>We measure how well 10 contemporary larger language models (LLMs) can identify their own generated text versus text from other models.
Score: 22.582593406983907
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-recognition is a crucial metacognitive capability for AI systems, relevant not only for psychological analysis but also for safety, particularly in evaluative scenarios. Motivated by contradictory interpretations of whether models possess self-recognition (Panickssery et al., 2024; Davidson et al., 2024), we introduce a systematic evaluation framework that can be easily applied and updated. Specifically, we measure how well 10 contemporary larger language models (LLMs) can identify their own generated text versus text from other models through two tasks: binary self-recognition and exact model prediction. Different from prior claims, our results reveal a consistent failure in self-recognition. Only 4 out of 10 models predict themselves as generators, and the performance is rarely above random chance. Additionally, models exhibit a strong bias toward predicting GPT and Claude families. We also provide the first evaluation of model awareness of their own and others' existence, as well as the reasoning behind their choices in self-recognition. We find that the model demonstrates some knowledge of its own existence and other models, but their reasoning reveals a hierarchical bias. They appear to assume that GPT, Claude, and occasionally Gemini are the top-tier models, often associating high-quality text with them. We conclude by discussing the implications of our findings on AI safety and future directions to develop appropriate AI self-awareness.

Related papers

A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior [11.616524876789624]
LLM self-explanations are often presented as a promising tool for AI oversight, yet their faithfulness to the model's true reasoning process is poorly understood.<n>We introduce Normalized Simulata Gainbility (NSG), a metric based on the idea that a faithful explanation should allow an observer to learn a model's decision-making criteria.<n>We find self-explanations substantially improve prediction of model behavior (11-37% NSG)
arXiv Detail & Related papers (2026-02-02T18:54:51Z)
Emergent Introspective Awareness in Large Language Models [2.2458442204933]
We investigate whether large language models can introspect on their internal states.<n>We find that models can, in certain scenarios, notice the presence of injected concepts and accurately identify them.<n>Claude Opus 4 and 4.1, the most capable models, generally demonstrate the greatest introspective awareness.
arXiv Detail & Related papers (2026-01-05T06:47:41Z)
Metacognitive Sensitivity for Test-Time Dynamic Model Selection [0.0]
We propose a new framework for evaluating and leveraging AI metacognition.<n>We introduce meta-d', a psychologically-grounded measure of metacognitive sensitivity, to characterise how reliably a model's confidence predicts its own accuracy.<n>We then use this dynamic sensitivity score as context for a bandit-based arbiter that performs test-time model selection.
arXiv Detail & Related papers (2025-12-11T09:15:05Z)
Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors [61.92704516732144]
We show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior.<n>We propose two methods that leverage causal mechanisms to predict the correctness of model outputs.
arXiv Detail & Related papers (2025-05-17T00:31:39Z)
Thinking Out Loud: Do Reasoning Models Know When They're Right? [19.776645881640178]
Large reasoning models (LRMs) have recently demonstrated impressive capabilities in complex reasoning tasks.<n>We investigate how LRMs interact with other model behaviors by analyzing verbalized confidence.<n>We find that reasoning models may possess a diminished awareness of their own knowledge boundaries.
arXiv Detail & Related papers (2025-04-09T03:58:19Z)
Analyzing Advanced AI Systems Against Definitions of Life and Consciousness [0.0]
We propose a number of metrics for examining whether an advanced AI system has gained consciousness.<n>We suggest that sufficiently advanced architectures exhibiting immune like sabotage defenses, mirror self-recognition analogs, or meta-cognitive updates may cross key thresholds akin to life-like or consciousness-like traits.
arXiv Detail & Related papers (2025-02-07T15:27:34Z)
Frontier Models are Capable of In-context Scheming [41.30527987937867]
One safety concern is that AI agents might covertly pursue misaligned goals, hiding their true capabilities and objectives.<n>We evaluate frontier models on a suite of six agentic evaluations where models are instructed to pursue goals.<n>We find that o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B all demonstrate in-context scheming capabilities.
arXiv Detail & Related papers (2024-12-06T12:09:50Z)
Self-Improvement in Language Models: The Sharpening Mechanism [70.9248553790022]
We offer a new perspective on the capabilities of self-improvement through a lens we refer to as sharpening.<n>Motivated by the observation that language models are often better at verifying response quality than they are at generating correct responses, we formalize self-improvement as using the model itself as a verifier during post-training.<n>We analyze two natural families of self-improvement algorithms based on SFT and RLHF.
arXiv Detail & Related papers (2024-12-02T20:24:17Z)
On the Fairness, Diversity and Reliability of Text-to-Image Generative Models [68.62012304574012]
multimodal generative models have sparked critical discussions on their reliability, fairness and potential for misuse.<n>We propose an evaluation framework to assess model reliability by analyzing responses to global and local perturbations in the embedding space.<n>Our method lays the groundwork for detecting unreliable, bias-injected models and tracing the provenance of embedded biases.
arXiv Detail & Related papers (2024-11-21T09:46:55Z)
From Imitation to Introspection: Probing Self-Consciousness in Language Models [8.357696451703058]
Self-consciousness is the introspection of one's existence and thoughts. This work presents a practical definition of self-consciousness for language models.
arXiv Detail & Related papers (2024-10-24T15:08:17Z)
Navigating the OverKill in Large Language Models [84.62340510027042]
We investigate the factors for overkill by exploring how models handle and determine the safety of queries. Our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill. We introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy, to alleviate this phenomenon.
arXiv Detail & Related papers (2024-01-31T07:26:47Z)
Towards Evaluating AI Systems for Moral Status Using Self-Reports [9.668566887752458]
We argue that under the right circumstances, self-reports could provide an avenue for investigating whether AI systems have states of moral significance. To make self-reports more appropriate, we propose to train models to answer many kinds of questions about themselves with known answers. We then propose methods for assessing the extent to which these techniques have succeeded.
arXiv Detail & Related papers (2023-11-14T22:45:44Z)
Explain, Edit, and Understand: Rethinking User Study Design for Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews. We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z)
Is Automated Topic Model Evaluation Broken?: The Incoherence of Coherence [62.826466543958624]
We look at the standardization gap and the validation gap in topic model evaluation. Recent models relying on neural components surpass classical topic models according to these metrics. We use automatic coherence along with the two most widely accepted human judgment tasks, namely, topic rating and word intrusion.
arXiv Detail & Related papers (2021-07-05T17:58:52Z)
Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics. We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.