How Does Prefix Matter in Reasoning Model Tuning?
- URL: http://arxiv.org/abs/2601.01624v1
- Date: Sun, 04 Jan 2026 18:04:23 GMT
- Title: How Does Prefix Matter in Reasoning Model Tuning?
- Authors: Raj Vardhan Tomar, Preslav Nakov, Yuxia Wang,
- Abstract summary: We fine-tune three R1 series models across three core model capabilities: reasoning (mathematics), coding, safety, and factuality.<n>Results show that prefix-conditioned SFT improves both safety and reasoning performance, yielding up to +6% higher Safe@1 accuracy.
- Score: 57.69882799751655
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent alignment studies commonly remove introductory boilerplate phrases from supervised fine-tuning (SFT) datasets. This work challenges that assumption. We hypothesize that safety- and reasoning-oriented prefix sentences serve as lightweight alignment signals that can guide model decoding toward safer and more coherent responses. To examine this, we fine-tune three R1 series models across three core model capabilities: reasoning (mathematics, coding), safety, and factuality, systematically varying prefix inclusion from 0% to 100%. Results show that prefix-conditioned SFT improves both safety and reasoning performance, yielding up to +6% higher Safe@1 accuracy on adversarial benchmarks (WildJailbreak, StrongReject) and +7% improvement on GSM8K reasoning. However, factuality and coding tasks show marginal or negative effects, indicating that prefix-induced narrowing of the search space benefits structured reasoning. Token-level loss analysis further reveals that prefix tokens such as "revised" and "logically" incur higher gradient magnitudes, acting as alignment anchors that stabilize reasoning trajectories. Our findings suggest that prefix conditioning offers a scalable and interpretable mechanism for improving reasoning safety, serving as an implicit form of alignment that complements traditional reward-based methods.
Related papers
- Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment [13.463606100715504]
Large language models are vulnerable to attacks that disguise harmful intent.<n>This vulnerability stems from shallow alignment mechanisms that lack deep reasoning.<n>We propose enhancing alignment through reasoning-aware post-training.
arXiv Detail & Related papers (2026-02-24T20:30:51Z) - THINKSAFE: Self-Generated Safety Alignment for Reasoning Models [60.10077024249373]
We propose ThinkSafe, a framework that restores safety alignment without external teachers.<n>Our key insight is that while compliance suppresses safety mechanisms, models often retain latent knowledge to identify harm.<n> Experiments on DeepSeek-R1-Distill and Qwen3 show ThinkSafe significantly improves safety while preserving reasoning proficiency.
arXiv Detail & Related papers (2026-01-30T16:31:02Z) - Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning? [68.82210578851442]
We investigate why safety alignment fails in reasoning models through a mechanistic interpretability lens.<n>Using a linear probing approach to trace refusal intentions across token positions, we discover a phenomenon termed as textbfrefusal cliff<n>We propose textbfCliff-as-a-Judge, a novel data selection method that identifies training examples exhibiting the largest refusal cliff to efficiently repair reasoning models' safety alignment.
arXiv Detail & Related papers (2025-10-07T15:32:59Z) - Large Reasoning Models Learn Better Alignment from Flawed Thinking [56.08883934423522]
Large reasoning models (LRMs) "think" by generating structured chain-of-thought (CoT) before producing a final answer.<n>We propose RECAP, a principled reinforcement learning (RL) method for post-training that explicitly teaches models to override flawed reasoning trajectories.
arXiv Detail & Related papers (2025-10-01T14:15:43Z) - UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases [57.69882799751655]
We release UnsafeChain, a safety alignment dataset constructed from hard prompts with diverse sources.<n>We fine-tune three large reasoning models (LRMs) and compare them against recent SafeChain and STAR-1.<n>UnsafeChain consistently outperforms prior datasets, with even a 1K subset matching or surpassing baseline performance.
arXiv Detail & Related papers (2025-07-29T10:08:52Z) - Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models [0.0]
Reasoning Language Models (RLMs) have gained traction for their ability to perform complex, multi-step reasoning tasks.<n>While these capabilities promise improved reliability, their impact on robustness to social biases remains unclear.<n>We leverage the CLEAR-Bias benchmark to investigate the adversarial robustness of RLMs to bias elicitation.
arXiv Detail & Related papers (2025-07-03T17:01:53Z) - SEAL: Steerable Reasoning Calibration of Large Language Models for Free [58.931194824519935]
Large Language Models (LLMs) have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism.<n>Recent studies reveal substantial redundancy in the CoT reasoning traces, which negatively impacts model performance.<n>We introduce SEAL, a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains.
arXiv Detail & Related papers (2025-04-07T02:42:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.