Understanding Reasoning in Thinking Language Models via Steering Vectors
- URL: http://arxiv.org/abs/2506.18167v3
- Date: Thu, 17 Jul 2025 23:27:34 GMT
- Title: Understanding Reasoning in Thinking Language Models via Steering Vectors
- Authors: Constantin Venhoff, Iván Arcuschin, Philip Torr, Arthur Conmy, Neel Nanda,
- Abstract summary: We analyze and manipulate specific reasoning behaviors in DeepSeek-R1-Distill models.<n>We demonstrate that these behaviors are mediated by linear directions in the model's activation space and can be controlled using steering vectors.<n>Our approach offers practical tools for steering reasoning processes in thinking models in a controlled and interpretable manner.
- Score: 9.417134634193074
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in large language models (LLMs) have led to the development of thinking language models that generate extensive internal reasoning chains before producing responses. While these models achieve improved performance, controlling their reasoning processes remains challenging. This work presents a steering approach for thinking LLMs by analyzing and manipulating specific reasoning behaviors in DeepSeek-R1-Distill models. Through a systematic experiment on 500 tasks across 10 diverse categories, we identify several reasoning behaviors exhibited by thinking models, including expressing uncertainty, generating examples for hypothesis validation, and backtracking in reasoning chains. We demonstrate that these behaviors are mediated by linear directions in the model's activation space and can be controlled using steering vectors. By extracting and applying these vectors, we provide a method to modulate specific aspects of the model's reasoning process, such as its tendency to backtrack or express uncertainty. Our approach offers practical tools for steering reasoning processes in thinking models in a controlled and interpretable manner. We validate our steering method using three DeepSeek-R1-Distill models, demonstrating consistent control across different model architectures.
Related papers
- Don't Overthink It: A Survey of Efficient R1-style Large Reasoning Models [49.598776427454176]
Large Reasoning Models (LRMs) have gradually become a research hotspot due to their outstanding performance in handling complex tasks.<n>However, with the widespread application of these models, the problem of overthinking has gradually emerged.<n>Various efficient reasoning methods have been proposed, aiming to reduce the length of reasoning paths without compromising model performance and reasoning capability.
arXiv Detail & Related papers (2025-08-04T06:54:31Z) - Reasoning-Finetuning Repurposes Latent Representations in Base Models [1.3286418032136589]
Backtracking, an emergent behavior elicited by reasoning fine-tuning, has been shown to be a key mechanism in reasoning models' enhanced capabilities.<n>We show that the emergence of backtracking is in part driven by a repurposed direction already present in base model activations.
arXiv Detail & Related papers (2025-07-16T21:21:03Z) - Let LLMs Break Free from Overthinking via Self-Braking Tuning [60.08396797526657]
Large reasoning models (LRMs) have significantly enhanced their reasoning capabilities by generating longer chains of thought.<n>This performance gain comes at the cost of a substantial increase in redundant reasoning during the generation process.<n>We propose a novel framework, Self-Braking Tuning (SBT), which tackles overthinking from the perspective of allowing the model to regulate its own reasoning process.
arXiv Detail & Related papers (2025-05-20T16:53:40Z) - ExpertSteer: Intervening in LLMs through Expert Knowledge [71.12193680015622]
Activation steering offers a promising method to control the generation process of Large Language Models.<n>We propose ExpertSteer, a novel approach that leverages arbitrary specialized expert models to generate steering vectors.<n>We conduct comprehensive experiments using three LLMs on 15 popular benchmarks across four distinct domains.
arXiv Detail & Related papers (2025-05-18T08:55:46Z) - Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors [61.92704516732144]
We show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior.<n>We propose two methods that leverage causal mechanisms to predict the correctness of model outputs.
arXiv Detail & Related papers (2025-05-17T00:31:39Z) - The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think [81.38614558541772]
We introduce the CoT Encyclopedia, a framework for analyzing and steering model reasoning.<n>Our method automatically extracts diverse reasoning criteria from model-generated CoTs.<n>We show that this framework produces more interpretable and comprehensive analyses than existing methods.
arXiv Detail & Related papers (2025-05-15T11:31:02Z) - Improving Reasoning Performance in Large Language Models via Representation Engineering [2.0099933815960256]
We propose a representation engineering approach for large language models (LLMs)<n>Model activations are read from the residual stream of an LLM when processing a reasoning task.<n>We show that an LLM can, to a certain degree, be controlled to improve its perceived reasoning ability by modulating activations.
arXiv Detail & Related papers (2025-04-28T04:58:43Z) - Towards Understanding Distilled Reasoning Models: A Representational Approach [6.563993791037387]
We train a crosscoder on Qwen-series models and their fine-tuned variants.<n>Our results suggest that the crosscoder learns features corresponding to various types of reasoning, including self-reflection and verification.
arXiv Detail & Related papers (2025-03-05T18:40:19Z) - A Tutorial on LLM Reasoning: Relevant Methods behind ChatGPT o1 [6.527607790666018]
OpenAI o1 has shown that applying reinforcement learning to integrate reasoning steps directly during inference can significantly improve a model's reasoning capabilities.<n>We present a comprehensive formulation of reasoning problems and investigate the use of both model-based and model-free approaches to better support this slow-thinking framework.
arXiv Detail & Related papers (2025-02-15T17:52:11Z) - Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning [9.795934690403374]
It is still unclear which multi-step reasoning mechanisms are used by language models to solve such tasks.<n>We employ circuit analysis and self-influence functions to evaluate the changing importance of each token throughout the reasoning process.<n>We demonstrate that the underlying circuits reveal a human-interpretable reasoning process used by the model.
arXiv Detail & Related papers (2025-02-13T07:19:05Z) - Self-supervised Analogical Learning using Language Models [59.64260218737556]
We propose SAL, a self-supervised analogical learning framework.<n> SAL mimics the human analogy process and trains models to explicitly transfer high-quality symbolic solutions.<n>We show that the resulting models outperform base language models on a wide range of reasoning benchmarks.
arXiv Detail & Related papers (2025-02-03T02:31:26Z) - Improving Instruction-Following in Language Models through Activation Steering [58.876600545898675]
We derive instruction-specific vector representations from language models and use them to steer models accordingly.<n>We demonstrate how this method can enhance model adherence to constraints such as output format, length, and word inclusion.<n>Our findings demonstrate that activation steering offers a practical and scalable approach for fine-grained control in language generation.
arXiv Detail & Related papers (2024-10-15T08:38:20Z) - The Buffer Mechanism for Multi-Step Information Reasoning in Language Models [52.77133661679439]
Investigating internal reasoning mechanisms of large language models can help us design better model architectures and training strategies.
In this study, we constructed a symbolic dataset to investigate the mechanisms by which Transformer models employ vertical thinking strategy.
We proposed a random matrix-based algorithm to enhance the model's reasoning ability, resulting in a 75% reduction in the training time required for the GPT-2 model.
arXiv Detail & Related papers (2024-05-24T07:41:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.