Related papers: ThinkPilot: Steering Reasoning Models via Automated Think-prefixes Optimization

ThinkPilot: Steering Reasoning Models via Automated Think-prefixes Optimization

URL: http://arxiv.org/abs/2510.12063v1
Date: Tue, 14 Oct 2025 02:02:19 GMT
Title: ThinkPilot: Steering Reasoning Models via Automated Think-prefixes Optimization
Authors: Sunzhu Li, Zhiyu Lin, Shuling Yang, Jiale Zhao, Wei Chen,
Abstract summary: Large Reasoning Models (LRMs) are powerful, but they still suffer from inefficient and off-target reasoning.<n>In this paper, we introduce ThinkPilot, a training-free framework that automatically optimize LRMs reasoning.<n>It uses an evolutionary process to generate think-es, which are instructions that evolve driven by a taxonomy of reasoning behaviors.
Score: 8.765548346606218
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Reasoning Models (LRMs) are powerful, but they still suffer from inefficient and off-target reasoning. Currently, training-free methods are limited to either rigid heuristics or descriptive, non-actionable analyses. In this paper, we introduce ThinkPilot, a training-free framework that automatically optimizes LRMs reasoning. It uses an evolutionary process to generate think-prefixes, which are instructions that evolve driven by a taxonomy of reasoning behaviors to guide models toward superior performance. Extensive experiments demonstrate ThinkPilot's broad effectiveness: it significantly improves the accuracy-length trade-off for efficient reasoning, drastically improves safety (for example, cutting the StrongREJECT score of DeepSeek-R1-Distill-Qwen-32B from 27.0% to 0.7), and enhances instruction following. It also synergizes with existing training-based methods. Our analysis reveals that think-prefixes can reliably control LRMs' reasoning behaviors, and that different tasks have strong preferences for specific behavioral distributions. By automatically identifying and eliciting these behaviors, ThinkPilot provides a generalizable framework for aligning LRMs reasoning with task demands. Data and code are available at https://github.com/teqkilla/ThinkPilot

Related papers

ET-Agent: Incentivizing Effective Tool-Integrated Reasoning Agent via Behavior Calibration [68.89572566071575]
ETAgent is a training framework for calibrating agent's tool-use behavior.<n>It is designed to progressively calibrate erroneous behavioral patterns to optimal behaviors.
arXiv Detail & Related papers (2026-01-11T11:05:26Z)
Gold-Switch: Training-Free Superposition of Slow- and Fast- Thinking LLMs [36.84838904299283]
Large Reasoning Models (LRMs) excel in structured tasks by emulating deliberate human reasoning but often suffer from overthinking.<n>We propose a superposed deployment strategy with a lightweight, training-free regulation to optimize switching inference by one model on and off.
arXiv Detail & Related papers (2025-10-08T08:17:57Z)
SSPO: Self-traced Step-wise Preference Optimization for Process Supervision and Reasoning Compression [15.87106741558898]
Post-training methods incur substantial computational overhead due to auxiliary models and overthinking.<n>We propose Self-traced Step-wise Preference Optimization (SSPO), a plug RLgable process supervision framework.<n>SSPO uses step-wise preference signals generated by the model itself to guide the optimization process for reasoning compression.
arXiv Detail & Related papers (2025-08-18T04:02:15Z)
Good Learners Think Their Thinking: Generative PRM Makes Large Reasoning Model More Efficient Math Learner [31.033131727230277]
Large reasoning models (LRMs) have recently shown promise in solving complex math problems when optimized with Reinforcement Learning (RL)<n>We propose a novel intrinsic signal-driven generative process evaluation mechanism operating at the thought level to address major bottlenecks in RL-based training.<n>Experiments on 1.5B and 7B parameter LRMs demonstrate that our method achieves higher problem-solving accuracy with significantly fewer training samples than outcome-only reward baselines.
arXiv Detail & Related papers (2025-07-31T07:54:58Z)
BadReasoner: Planting Tunable Overthinking Backdoors into Large Reasoning Models for Fun or Profit [12.189197763012409]
Large language models (LRMs) have emerged as a significant advancement in artificial intelligence.<n>In this paper, we identify an unexplored attack vector against LRMs, which we term "overthinking tunables"<n>We propose a novel tunable backdoor, which moves beyond simple on/off attacks to one where an attacker can precisely control the extent of the model's reasoning verbosity.
arXiv Detail & Related papers (2025-07-24T11:24:35Z)
ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs [75.72672339168092]
We introduce ReasonFlux-PRM, a novel trajectory-aware PRM to evaluate trajectory-response type of reasoning traces.<n>ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data.<n>Our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling.
arXiv Detail & Related papers (2025-06-23T17:59:02Z)
Enhancing Training Data Attribution with Representational Optimization [57.61977909113113]
Training data attribution methods aim to measure how training data impacts a model's predictions.<n>We propose AirRep, a representation-based approach that closes this gap by learning task-specific and model-aligned representations explicitly for TDA.<n>AirRep introduces two key innovations: a trainable encoder tuned for attribution quality, and an attention-based pooling mechanism that enables accurate estimation of group-wise influence.
arXiv Detail & Related papers (2025-05-24T05:17:53Z)
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models [67.87579664988199]
TON is a two-stage training strategy for vision-language models (VLMs)<n>It introduces a think-or-not format that serves as a cold start for selective reasoning.<n>TON can reduce the completion length by up to 90% compared to vanilla GRPO.
arXiv Detail & Related papers (2025-05-22T16:13:29Z)
Let LLMs Break Free from Overthinking via Self-Braking Tuning [60.08396797526657]
Large reasoning models (LRMs) have significantly enhanced their reasoning capabilities by generating longer chains of thought.<n>This performance gain comes at the cost of a substantial increase in redundant reasoning during the generation process.<n>We propose a novel framework, Self-Braking Tuning (SBT), which tackles overthinking from the perspective of allowing the model to regulate its own reasoning process.
arXiv Detail & Related papers (2025-05-20T16:53:40Z)
Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL [36.40577746211243]
Large reasoning models (LRMs) are proficient at generating explicit, step-by-step reasoning sequences before producing final answers.<n>To address this over-thinking problem, we explore how to equip LRMs with adaptive thinking capabilities.<n>We propose AutoThink, a multi-stage reinforcement learning framework that progressively optimize reasoning policies.
arXiv Detail & Related papers (2025-05-16T04:01:57Z)
SEAL: Steerable Reasoning Calibration of Large Language Models for Free [58.190800043449336]
Large Language Models (LLMs) have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism.<n>Recent studies reveal substantial redundancy in the CoT reasoning traces, which negatively impacts model performance.<n>We introduce SEAL, a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains.
arXiv Detail & Related papers (2025-04-07T02:42:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.