Related papers: Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process

Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process

URL: http://arxiv.org/abs/2512.23988v1
Date: Tue, 30 Dec 2025 05:09:11 GMT
Title: Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process
Authors: Zhenyu Zhang, Shujian Zhang, John Lambert, Wenxuan Zhou, Zhangyang Wang, Mingqing Chen, Andrew Hard, Rajiv Mathews, Lun Wang,
Abstract summary: We propose an unsupervised framework for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors.<n>By segmenting chain-of-thought traces into sentence-level'steps', we uncover disentangled features corresponding to interpretable behaviors such as reflection and backtracking.<n>We demonstrate the ability to control response confidence by identifying confidence-related vectors in the SAE decoder space.
Score: 66.38541693477181
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the growing reasoning capabilities of recent large language models (LLMs), their internal mechanisms during the reasoning process remain underexplored. Prior approaches often rely on human-defined concepts (e.g., overthinking, reflection) at the word level to analyze reasoning in a supervised manner. However, such methods are limited, as it is infeasible to capture the full spectrum of potential reasoning behaviors, many of which are difficult to define in token space. In this work, we propose an unsupervised framework (namely, RISE: Reasoning behavior Interpretability via Sparse auto-Encoder) for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors. By segmenting chain-of-thought traces into sentence-level 'steps' and training sparse auto-encoders (SAEs) on step-level activations, we uncover disentangled features corresponding to interpretable behaviors such as reflection and backtracking. Visualization and clustering analyses show that these behaviors occupy separable regions in the decoder column space. Moreover, targeted interventions on SAE-derived vectors can controllably amplify or suppress specific reasoning behaviors, altering inference trajectories without retraining. Beyond behavior-specific disentanglement, SAEs capture structural properties such as response length, revealing clusters of long versus short reasoning traces. More interestingly, SAEs enable the discovery of novel behaviors beyond human supervision. We demonstrate the ability to control response confidence by identifying confidence-related vectors in the SAE decoder space. These findings underscore the potential of unsupervised latent discovery for both interpreting and controllably steering reasoning in LLMs.

Related papers

LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval [74.72139580745511]
LaSER is a novel self-distillation framework that internalizes explicit reasoning into the latent space of retrievers.<n>Our method successfully combines the reasoning depth of explicit CoT pipelines with the inference efficiency of standard dense retrievers.
arXiv Detail & Related papers (2026-03-02T04:11:18Z)
How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision? [45.11635323173876]
We conduct a comprehensive analysis of latent reasoning methods to better understand the role and behavior of latent representation in the process.<n>We find that while latent representations can encode multiple possibilities, the reasoning process does not faithfully implement structured search.<n>Our findings reveal a trade-off associated with supervision strength: stronger supervision mitigates shortcut behavior but restricts the ability of latent representations to maintain diverse hypotheses.
arXiv Detail & Related papers (2026-02-25T22:00:59Z)
From Latent Signals to Reflection Behavior: Tracing Meta-Cognitive Activation Trajectory in R1-Style LLMs [48.33546389897804]
R1-style LLMs have attracted growing attention for their capacity for self-reflection, yet the internal mechanisms underlying such behavior remain unclear.<n>Using the logit lens to read out token-level semantics, we uncover a structured progression.<n>Our findings suggest a human-like metacognitive process-progressing from latent monitoring, to discourse-level regulation, and to finally overt self-reflection.
arXiv Detail & Related papers (2026-02-02T11:58:24Z)
Unknown Unknowns: Why Hidden Intentions in LLMs Evade Detection [4.514361164656055]
We introduce a taxonomy of ten categories of hidden intentions, organised by intent, mechanism, context, and impact.<n>We systematically assess detection methods, including reasoning and non-reasoning LLM judges.<n>We find that detection collapses in realistic open-world settings, particularly under low-prevalence conditions.
arXiv Detail & Related papers (2026-01-26T14:59:17Z)
ActivationReasoning: Logical Reasoning in Latent Activation Spaces [43.17973499652433]
Large language models (LLMs) excel at generating fluent text, but their internal reasoning remains opaque and difficult to control.<n>We introduce ActivationReasoning (AR), a framework that embeds explicit logical reasoning into the latent space of LLMs.<n>AR scales robustly with reasoning complexity, generalizes to abstract and context-sensitive tasks, and transfers across model backbones.
arXiv Detail & Related papers (2025-10-21T00:21:04Z)
LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation [4.29885665563186]
LATENTGUARD is a framework that combines behavioral alignment with supervised latent space control for interpretable and precise safety steering.<n>Our results show significant improvements in both safety controllability and response interpretability without compromising utility.
arXiv Detail & Related papers (2025-09-24T07:31:54Z)
From "Aha Moments" to Controllable Thinking: Toward Meta-Cognitive Reasoning in Large Reasoning Models via Decoupled Reasoning and Control [11.321315058502215]
Large Reasoning Models (LRMs) have demonstrated a latent capacity for complex reasoning by spontaneously exhibiting cognitive behaviors such as step-by-step reasoning, reflection, and backtracking, commonly referred to as "Aha Moments"<n>However, such emergent behaviors remain unregulated and uncontrolled, often resulting in overthinking, where the model continues generating redundant reasoning content even after reaching reliable conclusions.<n>Current models are unable to monitor and adaptively manage their reasoning process to determine when to continue, backtrack, or terminate.<n>We propose the Meta-cognitive Reasoning Framework (MERA), which explicitly decouples the thinking process into distinct
arXiv Detail & Related papers (2025-08-06T13:59:17Z)
A Survey on Latent Reasoning [100.54120559169735]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities.<n>CoT reasoning that verbalizes intermediate steps limits the model's expressive bandwidth.<n>Latent reasoning tackles this bottleneck by performing multi-step inference entirely in the model's continuous hidden state.
arXiv Detail & Related papers (2025-07-08T17:29:07Z)
On Reasoning Strength Planning in Large Reasoning Models [50.61816666920207]
We find evidence that LRMs pre-plan the reasoning strengths in their activations even before generation.<n>We then uncover that LRMs encode this reasoning strength through a pre-allocated directional vector embedded in the activations of the model.<n>Our work provides new insights into the internal mechanisms of reasoning in LRMs and offers practical tools for controlling their reasoning behaviors.
arXiv Detail & Related papers (2025-06-10T02:55:13Z)
How do Transformers Learn Implicit Reasoning? [67.02072851088637]
We study how implicit multi-hop reasoning emerges by training transformers from scratch in a controlled symbolic environment.<n>We find that training with atomic triples is not necessary but accelerates learning, and that second-hop generalization relies on query-level exposure to specific compositional structures.
arXiv Detail & Related papers (2025-05-29T17:02:49Z)
Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations [2.759846687681801]
Large language models (LLMs) can sometimes report the strategies they actually use to solve tasks, yet at other times seem unable to recognize those strategies that govern their behavior.<n>This suggests a limited degree of metacognition - the capacity to monitor one's own cognitive processes for subsequent reporting and self-control.<n>We introduce a neuroscience-inspired neurofeedback paradigm that uses in-context learning to quantify metacognitive abilities of LLMs to report and control their activation patterns.
arXiv Detail & Related papers (2025-05-19T22:32:25Z)
Behaviour Discovery and Attribution for Explainable Reinforcement Learning [6.123880364445758]
Building trust in reinforcement learning (RL) agents requires understanding why they make certain decisions.<n>Existing explainability methods often focus on single states or entire trajectories.<n>We propose a fully offline, reward-free framework for behavior discovery and segmentation.
arXiv Detail & Related papers (2025-03-19T08:06:00Z)
Exploring Robustness of Unsupervised Domain Adaptation in Semantic Segmentation [74.05906222376608]
We propose adversarial self-supervision UDA (or ASSUDA) that maximizes the agreement between clean images and their adversarial examples by a contrastive loss in the output space. This paper is rooted in two observations: (i) the robustness of UDA methods in semantic segmentation remains unexplored, which pose a security concern in this field; and (ii) although commonly used self-supervision (e.g., rotation and jigsaw) benefits image tasks such as classification and recognition, they fail to provide the critical supervision signals that could learn discriminative representation for segmentation tasks.
arXiv Detail & Related papers (2021-05-23T01:50:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.