From Latent Signals to Reflection Behavior: Tracing Meta-Cognitive Activation Trajectory in R1-Style LLMs
- URL: http://arxiv.org/abs/2602.01999v2
- Date: Thu, 05 Feb 2026 15:27:46 GMT
- Title: From Latent Signals to Reflection Behavior: Tracing Meta-Cognitive Activation Trajectory in R1-Style LLMs
- Authors: Yanrui Du, Yibo Gao, Sendong Zhao, Jiayun Li, Haochun Wang, Qika Lin, Kai He, Bing Qin, Mengling Feng,
- Abstract summary: R1-style LLMs have attracted growing attention for their capacity for self-reflection, yet the internal mechanisms underlying such behavior remain unclear.<n>Using the logit lens to read out token-level semantics, we uncover a structured progression.<n>Our findings suggest a human-like metacognitive process-progressing from latent monitoring, to discourse-level regulation, and to finally overt self-reflection.
- Score: 48.33546389897804
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: R1-style LLMs have attracted growing attention for their capacity for self-reflection, yet the internal mechanisms underlying such behavior remain unclear. To bridge this gap, we anchor on the onset of reflection behavior and trace its layer-wise activation trajectory. Using the logit lens to read out token-level semantics, we uncover a structured progression: (i) Latent-control layers, where an approximate linear direction encodes the semantics of thinking budget; (ii) Semantic-pivot layers, where discourse-level cues, including turning-point and summarization cues, surface and dominate the probability mass; and (iii) Behavior-overt layers, where the likelihood of reflection-behavior tokens begins to rise until they become highly likely to be sampled. Moreover, our targeted interventions uncover a causal chain across these stages: prompt-level semantics modulate the projection of activations along latent-control directions, thereby inducing competition between turning-point and summarization cues in semantic-pivot layers, which in turn regulates the sampling likelihood of reflection-behavior tokens in behavior-overt layers. Collectively, our findings suggest a human-like meta-cognitive process-progressing from latent monitoring, to discourse-level regulation, and to finally overt self-reflection. Our analysis code can be found at https://github.com/DYR1/S3-CoT.
Related papers
- TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training [53.93696896939915]
Training tool-use agents typically rely on Supervised Fine-Tuning (SFT) on successful trajectories and Reinforcement Learning (RL) on pass-rate-selected tasks.<n>We propose TopoCurate, an interaction-aware framework that projects multi-trial rollouts from the same task into a unified semantic quotient topology.<n>TopoCurate achieves consistent gains of 4.2% (SFT) and 6.9% (RL) over state-of-the-art baselines.
arXiv Detail & Related papers (2026-03-02T10:38:54Z) - No Global Plan in Chain-of-Thought: Uncover the Latent Planning Horizon of LLMs [65.783709850324]
This work stems from prior complementary observations on the dynamics of Chain-of-Thought (CoT): Large Language Models (LLMs)<n>LLMs are shown latent planning of subsequent reasoning prior to CoT emergence, thereby diminishing the significance of explicit CoT.<n>We investigate the latent planning strength of LLMs, through our probing method, Tele-Lens, applying to hidden states across diverse task domains.
arXiv Detail & Related papers (2026-02-02T13:46:56Z) - ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval [64.14282916266998]
Composed Image Retrieval aims to retrieve target images based on a hybrid query comprising a reference image and a modification text.<n>We propose ReCALL, a model-agnostic framework that follows a diagnose-generate-refine pipeline.<n>Experiments on CIRR and FashionIQ show that ReCALL consistently recalibrates degraded capabilities and achieves state-of-the-art performance.
arXiv Detail & Related papers (2026-02-02T04:52:54Z) - Hallucination Begins Where Saliency Drops [18.189047289404325]
hallucinations frequently arise when preceding output tokens exhibit low saliency toward the prediction of the next token.<n>We introduce LVLMs-Saliency, a gradient-aware diagnostic framework that quantifies the visual grounding strength of each output token.<n>Our method significantly reduces hallucination rates while preserving fluency and task performance, offering a robust and interpretable solution.
arXiv Detail & Related papers (2026-01-28T05:50:52Z) - LLM-MC-Affect: LLM-Based Monte Carlo Modeling of Affective Trajectories and Latent Ambiguity for Interpersonal Dynamic Insight [1.1119672724275114]
Emotional coordination is a core property of human interaction that shapes how meaning is constructed in real time.<n>We introduce a probabilistic framework that characterizes emotion not as a static label, but as a continuous latent probability distribution.<n>This work establishes a scalable and deployable pathway for understanding interpersonal dynamics, offering a generalizable solution.
arXiv Detail & Related papers (2026-01-07T06:50:41Z) - CBMAS: Cognitive Behavioral Modeling via Activation Steering [5.131778762865578]
Large language models (LLMs) often encode cognitive behaviors unpredictably across prompts, layers, and contexts.<n>We present CBMAS, a diagnostic framework for continuous activation steering.
arXiv Detail & Related papers (2026-01-03T13:04:14Z) - Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process [66.38541693477181]
We propose an unsupervised framework for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors.<n>By segmenting chain-of-thought traces into sentence-level'steps', we uncover disentangled features corresponding to interpretable behaviors such as reflection and backtracking.<n>We demonstrate the ability to control response confidence by identifying confidence-related vectors in the SAE decoder space.
arXiv Detail & Related papers (2025-12-30T05:09:11Z) - SpatialTree: How Spatial Abilities Branch Out in MLLMs [109.32057088014942]
We introduce a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2), simulation (L3), and agentic competence (L4)<n>We find that naive RL that encourages extensive "thinking" is unreliable: it helps complex reasoning but hurts intuitive perception.<n>We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels.
arXiv Detail & Related papers (2025-12-23T18:59:46Z) - Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization [56.083511902353365]
Reinforcement learning (RL) typically applies uniform credit across an entire generation of Large language models.<n>This work positions attention as a privileged substrate that renders the internal logic of LLMs as a mechanistic blueprint of reasoning itself.<n>We introduce three novel RL strategies that dynamically perform targeted credit assignment to critical nodes.
arXiv Detail & Related papers (2025-10-15T13:49:51Z) - REAL: Reading Out Transformer Activations for Precise Localization in Language Model Steering [26.428347164111926]
Inference-time steering aims to alter a large language model's responses without changing its parameters.<n>Existing approaches often rely on simplistic cues or ad hoc generalizations.<n>We introduce REAL, a framework for identifying behavior-relevant modules in Transformer models.
arXiv Detail & Related papers (2025-06-10T02:16:50Z) - Universal Response and Emergence of Induction in LLMs [0.0]
We study the emergence of induction behavior within LLMs by probing their response to weak single-token perturbations of the residual stream.
We find that LLMs exhibit a robust, universal regime in which their response remains scale-invariant under changes in perturbation strength.
Our results provide insights into the collective interplay of components within LLMs and serve as a benchmark for large-scale circuit analysis.
arXiv Detail & Related papers (2024-11-11T15:47:15Z) - Extreme Memorization via Scale of Initialization [72.78162454173803]
We construct an experimental setup in which changing the scale of initialization strongly impacts the implicit regularization induced by SGD.
We find that the extent and manner in which generalization ability is affected depends on the activation and loss function used.
In the case of the homogeneous ReLU activation, we show that this behavior can be attributed to the loss function.
arXiv Detail & Related papers (2020-08-31T04:53:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.