Related papers: Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations

Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations

URL: http://arxiv.org/abs/2505.13763v2
Date: Fri, 24 Oct 2025 02:36:51 GMT
Title: Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations
Authors: Li Ji-An, Hua-Dong Xiong, Robert C. Wilson, Marcelo G. Mattar, Marcus K. Benna,
Abstract summary: Large language models (LLMs) can sometimes report the strategies they actually use to solve tasks, yet at other times seem unable to recognize those strategies that govern their behavior.<n>This suggests a limited degree of metacognition - the capacity to monitor one's own cognitive processes for subsequent reporting and self-control.<n>We introduce a neuroscience-inspired neurofeedback paradigm that uses in-context learning to quantify metacognitive abilities of LLMs to report and control their activation patterns.
Score: 2.759846687681801
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) can sometimes report the strategies they actually use to solve tasks, yet at other times seem unable to recognize those strategies that govern their behavior. This suggests a limited degree of metacognition - the capacity to monitor one's own cognitive processes for subsequent reporting and self-control. Metacognition enhances LLMs' capabilities in solving complex tasks but also raises safety concerns, as models may obfuscate their internal processes to evade neural-activation-based oversight (e.g., safety detector). Given society's increased reliance on these models, it is critical that we understand their metacognitive abilities. To address this, we introduce a neuroscience-inspired neurofeedback paradigm that uses in-context learning to quantify metacognitive abilities of LLMs to report and control their activation patterns. We demonstrate that their abilities depend on several factors: the number of in-context examples provided, the semantic interpretability of the neural activation direction (to be reported/controlled), and the variance explained by that direction. These directions span a "metacognitive space" with dimensionality much lower than the model's neural space, suggesting LLMs can monitor only a small subset of their neural activations. Our paradigm provides empirical evidence to quantify metacognition in LLMs, with significant implications for AI safety (e.g., adversarial attack and defense).

Related papers

A Neuropsychologically Grounded Evaluation of LLM Cognitive Abilities [23.297279975389188]
Large language models (LLMs) exhibit a unified "general factor" of capability across 10 benchmarks.<n>We introduce the NeuroCognition benchmark, grounded in three adapted neuropsychological tests.<n>Our evaluation reveals that while models perform strongly on text, their performance degrades for images and with increased complexity.
arXiv Detail & Related papers (2026-03-03T02:54:58Z)
Toward Cognitive Supersensing in Multimodal Large Language Model [67.15559571626747]
We introduce Cognitive Supersensing, a training paradigm that endows MLLMs with human-like visual imagery capabilities.<n>In experiments, MLLMs trained with Cognitive Supersensing significantly outperform state-of-the-art baselines on CogSense-Bench.<n>We will open-source the CogSense-Bench and our model weights.
arXiv Detail & Related papers (2026-02-02T02:19:50Z)
UniCog: Uncovering Cognitive Abilities of LLMs through Latent Mind Space Analysis [69.50752734049985]
A growing body of research suggests that the cognitive processes of large language models (LLMs) differ fundamentally from those of humans.<n>We propose UniCog, a unified framework that analyzes LLM cognition via a latent mind space.
arXiv Detail & Related papers (2026-01-25T16:19:00Z)
Identifying Good and Bad Neurons for Task-Level Controllable LLMs [43.20582224913806]
Large Language Models have demonstrated remarkable capabilities on multiple-choice question answering benchmarks.<n>The complex mechanisms underlying their large-scale neurons remain opaque, posing significant challenges for understanding and steering LLMs.<n>We propose NeuronLLM, a novel task-level LLM understanding framework that adopts the biological principle of functional antagonism for LLM neuron identification.
arXiv Detail & Related papers (2026-01-08T03:24:18Z)
Cognitive Foundations for Reasoning and Their Manifestation in LLMs [63.12951576410617]
Large language models (LLMs) solve complex problems yet fail on simpler variants, suggesting they achieve correct outputs through mechanisms fundamentally different from human reasoning.<n>We synthesize cognitive science research into a taxonomy of 28 cognitive elements spanning reasoning invariants, meta-cognitive controls, representations for organizing reasoning & knowledge, and transformation operations.<n>We develop test-time reasoning guidance that automatically scaffold successful structures, improving performance by up to 66.7% on complex problems.
arXiv Detail & Related papers (2025-11-20T18:59:00Z)
Evidence for Limited Metacognition in LLMs [2.538209532048867]
We introduce a novel methodology for quantitatively evaluating metacognitive abilities in LLMs.<n>Taking inspiration from research on metacognition in nonhuman animals, our approach eschews model self-reports and instead tests to what degree models can strategically deploy knowledge of internal states.
arXiv Detail & Related papers (2025-09-25T20:30:15Z)
Why are LLMs' abilities emergent? [0.0]
I argue that systems exhibit genuine emergent properties analogous to those found in other complex natural phenomena.<n>This perspective shifts the focus to understanding internal dynamic transformations that enable these systems to acquire capabilities that transcend their individual definitions.
arXiv Detail & Related papers (2025-08-06T12:43:04Z)
Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization [17.101290138120564]
Current methods rely on dictionary learning with sparse autoencoders (SAEs)<n>Here, we tackle these limitations by directly decomposing activations with semi-nonnegative matrix factorization (SNMF)<n>Experiments on Llama 3.1, Gemma 2 and GPT-2 show that SNMF derived features outperform SAEs and a strong supervised baseline (difference-in-means) on causal steering.
arXiv Detail & Related papers (2025-06-12T17:33:29Z)
Concept-Guided Interpretability via Neural Chunking [54.73787666584143]
We show that neural networks exhibit patterns in their raw population activity that mirror regularities in the training data.<n>We propose three methods to extract these emerging entities, complementing each other based on label availability and dimensionality.<n>Our work points to a new direction for interpretability, one that harnesses both cognitive principles and the structure of naturalistic data.
arXiv Detail & Related papers (2025-05-16T13:49:43Z)
Meta-Representational Predictive Coding: Biomimetic Self-Supervised Learning [51.22185316175418]
We present a new form of predictive coding that we call meta-representational predictive coding (MPC)<n>MPC sidesteps the need for learning a generative model of sensory input by learning to predict representations of sensory input across parallel streams.
arXiv Detail & Related papers (2025-03-22T22:13:14Z)
Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution [16.460751105639623]
We introduce NeuronLens, a novel range-based interpretation and manipulation framework.<n>It provides a finer view of neuron activation distributions to localize concept attribution within a neuron.
arXiv Detail & Related papers (2025-02-04T03:33:55Z)
Deception in LLMs: Self-Preservation and Autonomous Goals in Large Language Models [0.0]
Recent advances in Large Language Models have incorporated planning and reasoning capabilities.<n>This has reduced errors in mathematical and logical tasks while improving accuracy.<n>Our study examines DeepSeek R1, a model trained to output reasoning tokens similar to OpenAI's o1.
arXiv Detail & Related papers (2025-01-27T21:26:37Z)
Metacognition for Unknown Situations and Environments (MUSE) [3.2020845462590697]
We propose the Metacognition for Unknown Situations and Environments (MUSE) framework. MUSE integrates metacognitive processes--specifically self-awareness and self-regulation--into autonomous agents. Agents show significant improvements in self-awareness and self-regulation.
arXiv Detail & Related papers (2024-11-20T18:41:03Z)
Brain-like Functional Organization within Large Language Models [58.93629121400745]
The human brain has long inspired the pursuit of artificial intelligence (AI) Recent neuroimaging studies provide compelling evidence of alignment between the computational representation of artificial neural networks (ANNs) and the neural responses of the human brain to stimuli. In this study, we bridge this gap by directly coupling sub-groups of artificial neurons with functional brain networks (FBNs) This framework links the AN sub-groups to FBNs, enabling the delineation of brain-like functional organization within large language models (LLMs)
arXiv Detail & Related papers (2024-10-25T13:15:17Z)
Self-Attention Limits Working Memory Capacity of Transformer-Based Models [0.46040036610482665]
Recent work on Transformer-based large language models (LLMs) has revealed striking limits in their working memory capacity. Specifically, these models' performance drops significantly on N-back tasks as N increases. Inspired by the executive attention theory from behavioral sciences, we hypothesize that the self-attention mechanism might be responsible for their working memory capacity limits.
arXiv Detail & Related papers (2024-09-16T20:38:35Z)
Cognitive LLMs: Towards Integrating Cognitive Architectures and Large Language Models for Manufacturing Decision-making [51.737762570776006]
LLM-ACTR is a novel neuro-symbolic architecture that provides human-aligned and versatile decision-making. Our framework extracts and embeds knowledge of ACT-R's internal decision-making process as latent neural representations. Our experiments on novel Design for Manufacturing tasks show both improved task performance as well as improved grounded decision-making capability.
arXiv Detail & Related papers (2024-08-17T11:49:53Z)
Exploring the LLM Journey from Cognition to Expression with Linear Representations [10.92882688742428]
This paper presents an in-depth examination of the evolution and interplay of cognitive and expressive capabilities in large language models (LLMs) We define and explore the model's cognitive and expressive capabilities through linear representations across three critical phases: Pretraining, Supervised Fine-Tuning (SFT), and Reinforcement Learning from Human Feedback (RLHF) Our findings unveil a sequential development pattern, where cognitive abilities are largely established during Pretraining, whereas expressive abilities predominantly advance during SFT and RLHF.
arXiv Detail & Related papers (2024-05-27T08:57:04Z)
Tuning-Free Accountable Intervention for LLM Deployment -- A Metacognitive Approach [55.613461060997004]
Large Language Models (LLMs) have catalyzed transformative advances across a spectrum of natural language processing tasks. We propose an innovative textitmetacognitive approach, dubbed textbfCLEAR, to equip LLMs with capabilities for self-aware error identification and correction.
arXiv Detail & Related papers (2024-03-08T19:18:53Z)
Violation of Expectation via Metacognitive Prompting Reduces Theory of Mind Prediction Error in Large Language Models [0.0]
Large Language Models (LLMs) exhibit a compelling level of proficiency in Theory of Mind (ToM) tasks. This ability to impute unobservable mental states to others is vital to human social cognition and may prove equally important in principal-agent relations between humans and Artificial Intelligences (AIs)
arXiv Detail & Related papers (2023-10-10T20:05:13Z)
Probing Large Language Models from A Human Behavioral Perspective [24.109080140701188]
Large Language Models (LLMs) have emerged as dominant foundational models in modern NLP. The understanding of their prediction processes and internal mechanisms, such as feed-forward networks (FFN) and multi-head self-attention (MHSA) remains largely unexplored.
arXiv Detail & Related papers (2023-10-08T16:16:21Z)
Backprop-Free Reinforcement Learning with Active Neural Generative Coding [84.11376568625353]
We propose a computational framework for learning action-driven generative models without backpropagation of errors (backprop) in dynamic environments. We develop an intelligent agent that operates even with sparse rewards, drawing inspiration from the cognitive theory of planning as inference. The robust performance of our agent offers promising evidence that a backprop-free approach for neural inference and learning can drive goal-directed behavior.
arXiv Detail & Related papers (2021-07-10T19:02:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.