Related papers: Reasoning-Finetuning Repurposes Latent Representations in Base Models

Reasoning-Finetuning Repurposes Latent Representations in Base Models

URL: http://arxiv.org/abs/2507.12638v1
Date: Wed, 16 Jul 2025 21:21:03 GMT
Title: Reasoning-Finetuning Repurposes Latent Representations in Base Models
Authors: Jake Ward, Chuqiao Lin, Constantin Venhoff, Neel Nanda,
Abstract summary: Backtracking, an emergent behavior elicited by reasoning fine-tuning, has been shown to be a key mechanism in reasoning models' enhanced capabilities.<n>We show that the emergence of backtracking is in part driven by a repurposed direction already present in base model activations.
Score: 1.3286418032136589
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Backtracking, an emergent behavior elicited by reasoning fine-tuning, has been shown to be a key mechanism in reasoning models' enhanced capabilities. Prior work has succeeded in manipulating this behavior via steering vectors, but the underlying mechanism remains poorly understood. In this work, we show that the emergence of backtracking in DeepSeek-R1-Distill-Llama-8B is in part driven by a repurposed direction already present in base model activations. Specifically, we identify a direction in base Llama-3.1-8B's residual stream which systematically induces backtracking when used to steer the distilled reasoning model, and find that the effects of steering with this direction cannot be trivially explained by token-level attributes. We further find that this direction does not induce backtracking in the base model, suggesting that the reasoning finetuning process repurposes pre-existing representations to form new behavioral circuits. Additionally, we hypothesize that this direction is one of several which may work together to mediate backtracking. Our findings offer a compelling picture that reasoning-finetuned models repurpose pre-existing base model representations, rather than learn new capabilities from scratch.

Related papers

Lost at the Beginning of Reasoning [82.18834329384514]
We show that the first reasoning step exerts a disproportionately large influence on the final prediction.<n>We propose an efficient sampling strategy that leverages a reward model to identify and retain high-quality first reasoning steps.<n>We introduce a new benchmark specifically constructed with deliberately flawed first reasoning steps to systematically evaluate model self-correction capabilities.
arXiv Detail & Related papers (2025-06-27T09:53:57Z)
Understanding Reasoning in Thinking Language Models via Steering Vectors [9.417134634193074]
We analyze and manipulate specific reasoning behaviors in DeepSeek-R1-Distill models.<n>We demonstrate that these behaviors are mediated by linear directions in the model's activation space and can be controlled using steering vectors.<n>Our approach offers practical tools for steering reasoning processes in thinking models in a controlled and interpretable manner.
arXiv Detail & Related papers (2025-06-22T20:45:26Z)
From Emergence to Control: Probing and Modulating Self-Reflection in Language Models [23.176641726866105]
Self-reflection is a powerful behavior enabled by reinforcement learning with verifiable rewards.<n>We show that self-reflection is not exclusive to fine-tuned models.
arXiv Detail & Related papers (2025-06-13T20:40:13Z)
On Reasoning Strength Planning in Large Reasoning Models [50.61816666920207]
We find evidence that LRMs pre-plan the reasoning strengths in their activations even before generation.<n>We then uncover that LRMs encode this reasoning strength through a pre-allocated directional vector embedded in the activations of the model.<n>Our work provides new insights into the internal mechanisms of reasoning in LRMs and offers practical tools for controlling their reasoning behaviors.
arXiv Detail & Related papers (2025-06-10T02:55:13Z)
Mitigating Overthinking in Large Reasoning Models via Manifold Steering [32.666911833023526]
Large Reasoning Models (LRMs) exhibit a phenomenon known as overthinking during inference.<n>We propose Manifold Steering, a novel approach that elegantly projects the steering direction onto the low-dimensional activation manifold.<n>Our method reduces output tokens by up to 71% while maintaining and even improving the accuracy on several mathematical benchmarks.
arXiv Detail & Related papers (2025-05-28T14:39:26Z)
Steering LLM Reasoning Through Bias-Only Adaptation [4.486093197820339]
Reinforcement-learning finetuning does not create new capabilities but strengthens reasoning patterns already latent in the pretrained network.<n>We test this claim by training steering vectors: layer-wise biases that additively amplify selected hidden features.<n>Experiments on four base models across the GSM8K and MATH benchmarks show that steering vectors recover, and in several cases exceed, the accuracy of fully-tuned counterparts.
arXiv Detail & Related papers (2025-05-24T13:55:38Z)
SEAL: Steerable Reasoning Calibration of Large Language Models for Free [58.190800043449336]
Large Language Models (LLMs) have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism.<n>Recent studies reveal substantial redundancy in the CoT reasoning traces, which negatively impacts model performance.<n>We introduce SEAL, a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains.
arXiv Detail & Related papers (2025-04-07T02:42:07Z)
The Edge-of-Reach Problem in Offline Model-Based Reinforcement Learning [31.8260779160424]
We investigate how popular algorithms perform as the learned dynamics model is improved.<n>We propose Reach-Aware Learning (RAVL), a simple and robust method that directly addresses the edge-of-reach problem.
arXiv Detail & Related papers (2024-02-19T20:38:00Z)
Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL. We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training. For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z)
Log-linear Guardedness and its Implications [116.87322784046926]
Methods for erasing human-interpretable concepts from neural representations that assume linearity have been found to be tractable and useful. This work formally defines the notion of log-linear guardedness as the inability of an adversary to predict the concept directly from the representation. We show that, in the binary case, under certain assumptions, a downstream log-linear model cannot recover the erased concept.
arXiv Detail & Related papers (2022-10-18T17:30:02Z)
Remembering for the Right Reasons: Explanations Reduce Catastrophic Forgetting [100.75479161884935]
We propose a novel training paradigm called Remembering for the Right Reasons (RRR) RRR stores visual model explanations for each example in the buffer and ensures the model has "the right reasons" for its predictions. We demonstrate how RRR can be easily added to any memory or regularization-based approach and results in reduced forgetting.
arXiv Detail & Related papers (2020-10-04T10:05:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.