Reasoning Model Unlearning: Forgetting Traces, Not Just Answers, While Preserving Reasoning Skills
- URL: http://arxiv.org/abs/2506.12963v2
- Date: Fri, 10 Oct 2025 19:19:51 GMT
- Title: Reasoning Model Unlearning: Forgetting Traces, Not Just Answers, While Preserving Reasoning Skills
- Authors: Changsheng Wang, Chongyu Fan, Yihua Zhang, Jinghan Jia, Dennis Wei, Parikshit Ram, Nathalie Baracaldo, Sijia Liu,
- Abstract summary: Large reasoning models (LRMs) have enabled strong chain-of-thought (CoT) generation through test-time computation.<n>We show that conventional unlearning algorithms, originally designed for non-reasoning models, are inadequate for LRMs.<n>We propose Reasoning-aware Representation Misdirection for Unlearning ($R2MU$), a novel method that effectively suppresses sensitive reasoning traces.
- Score: 42.1825027925353
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in large reasoning models (LRMs) have enabled strong chain-of-thought (CoT) generation through test-time computation. While these multi-step reasoning capabilities represent a major milestone in language model performance, they also introduce new safety risks. In this work, we present the first systematic study to revisit the problem of machine unlearning in the context of LRMs. Machine unlearning refers to the process of removing the influence of sensitive, harmful, or undesired data or knowledge from a trained model without full retraining. We show that conventional unlearning algorithms, originally designed for non-reasoning models, are inadequate for LRMs. In particular, even when final answers are successfully erased, sensitive information often persists within the intermediate reasoning steps, i.e., CoT trajectories. To address this challenge, we extend conventional unlearning and propose Reasoning-aware Representation Misdirection for Unlearning ($R^2MU$), a novel method that effectively suppresses sensitive reasoning traces and prevents the generation of associated final answers, while preserving the model's reasoning ability. Our experiments demonstrate that $R^2MU$ significantly reduces sensitive information leakage within reasoning traces and achieves strong performance across both safety and reasoning benchmarks, evaluated on state-of-the-art models such as DeepSeek-R1-Distill-LLaMA-8B and DeepSeek-R1-Distill-Qwen-14B.
Related papers
- Native Reasoning Models: Training Language Models to Reason on Unverifiable Data [16.065264121785294]
We introduce NRT (Native Reasoning Training), a novel framework that cultivates complex reasoning.<n>NRT reframes the training problem by treating the reasoning process as a latent variable.<n>NRT achieves state-of-the-art performance among verifier-free methods.
arXiv Detail & Related papers (2026-02-12T04:15:46Z) - Identifying and Transferring Reasoning-Critical Neurons: Improving LLM Inference Reliability via Activation Steering [50.63386303357225]
We propose AdaRAS, a lightweight test-time framework that improves reasoning reliability by selectively intervening on neuron activations.<n>AdaRAS identifies Reasoning-Critical Neurons (RCNs) via a polarity-aware mean-difference criterion and adaptively steers their activations during inference.<n> Experiments on 10 mathematics and coding benchmarks demonstrate consistent improvements, including over 13% gains on AIME-24 and AIME-25.
arXiv Detail & Related papers (2026-01-27T17:53:01Z) - Towards Reasoning-Preserving Unlearning in Multimodal Large Language Models [17.184948937224142]
Machine unlearning aims to erase requested data from trained models without full retraining.<n> intermediate chain-of-thought steps can still leak sensitive information even when final answers are forgotten.<n>We propose R-MUSE, a training-free and inference-time intervention framework that steers internal representations to forget both answers and reasoning traces.
arXiv Detail & Related papers (2025-11-26T13:45:52Z) - Learning to Reason: Training LLMs with GPT-OSS or DeepSeek R1 Reasoning Traces [2.0789230137053014]
Test-time scaling has enabled a new class of Large Language Models (LLMs) that are able to reason through complex problems.<n>We compare the performance of medium-sized LLMs on Math problems after post-training on two kinds of reasoning traces.
arXiv Detail & Related papers (2025-11-24T17:26:58Z) - Lost at the Beginning of Reasoning [82.18834329384514]
We show that the first reasoning step exerts a disproportionately large influence on the final prediction.<n>We propose an efficient sampling strategy that leverages a reward model to identify and retain high-quality first reasoning steps.<n>We introduce a new benchmark specifically constructed with deliberately flawed first reasoning steps to systematically evaluate model self-correction capabilities.
arXiv Detail & Related papers (2025-06-27T09:53:57Z) - Exploring and Exploiting the Inherent Efficiency within Large Reasoning Models for Self-Guided Efficiency Enhancement [101.77467538102924]
Large reasoning models (LRMs) exhibit overthinking, which hinders efficiency and inflates inference cost.<n>We propose two lightweight methods to enhance LRM efficiency.<n>First, we introduce Efficiency Steering, a training-free activation steering technique that modulates reasoning behavior via a single direction.<n>Second, we develop Self-Rewarded Efficiency RL, a reinforcement learning framework that dynamically balances task accuracy and brevity.
arXiv Detail & Related papers (2025-06-18T17:18:12Z) - Excessive Reasoning Attack on Reasoning LLMs [26.52688123765127]
In this work, we expose a novel threat: adversarial inputs can be crafted to exploit excessive reasoning behaviors.<n>Our results demonstrate a 3x to 9x increase in reasoning length with comparable utility performance.<n>Our crafted adversarial inputs exhibit transferability, inducing computational overhead in o3-mini, o1-mini, DeepSeek-R1, and QWQ models.
arXiv Detail & Related papers (2025-06-17T10:16:52Z) - Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models [27.142703756752997]
We introduce MathIF, a benchmark for evaluating instruction-following in mathematical reasoning tasks.<n>Our empirical analysis reveals a consistent tension between scaling up reasoning capacity and maintaining controllability.<n>We show that even simple interventions can partially recover obedience, though at the cost of reasoning performance.
arXiv Detail & Related papers (2025-05-20T18:18:01Z) - Let LLMs Break Free from Overthinking via Self-Braking Tuning [60.08396797526657]
Large reasoning models (LRMs) have significantly enhanced their reasoning capabilities by generating longer chains of thought.<n>This performance gain comes at the cost of a substantial increase in redundant reasoning during the generation process.<n>We propose a novel framework, Self-Braking Tuning (SBT), which tackles overthinking from the perspective of allowing the model to regulate its own reasoning process.
arXiv Detail & Related papers (2025-05-20T16:53:40Z) - RM-R1: Reward Modeling as Reasoning [81.50471199906738]
Reasoning Reward Models (ReasRMs) formulate reward modeling as a reasoning task.<n>We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1.<n>Our models achieve state-of-the-art performance across three reward model benchmarks on average.
arXiv Detail & Related papers (2025-05-05T06:11:12Z) - Concise Reasoning via Reinforcement Learning [13.657506042120167]
We revisit the core principles of reinforcement learning (RL)<n>We uncover a natural correlation between conciseness and accuracy that has been largely overlooked.<n>We show that introducing a secondary phase of RL training, using a very small set of problems, can significantly reduce chains of thought.
arXiv Detail & Related papers (2025-04-07T15:35:54Z) - ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with Iterative Retrieval Augmented Generation [38.64751082999587]
Large Reasoning Models (LRMs) exhibit remarkable reasoning abilities but rely primarily on parametric knowledge, limiting factual accuracy.<n>We propose ReaRAG, a factuality-enhanced reasoning model that explores diverse queries without excessive iterations.<n>Our study enhances LRMs' factuality while effectively integrating robust reasoning for Retrieval-Augmented Generation (RAG)
arXiv Detail & Related papers (2025-03-27T17:44:18Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - Analyzing Adversarial Inputs in Deep Reinforcement Learning [53.3760591018817]
We present a comprehensive analysis of the characterization of adversarial inputs, through the lens of formal verification.
We introduce a novel metric, the Adversarial Rate, to classify models based on their susceptibility to such perturbations.
Our analysis empirically demonstrates how adversarial inputs can affect the safety of a given DRL system with respect to such perturbations.
arXiv Detail & Related papers (2024-02-07T21:58:40Z) - A Deep Marginal-Contrastive Defense against Adversarial Attacks on 1D
Models [3.9962751777898955]
Deep learning algorithms have been recently targeted by attackers due to their vulnerability.
Non-continuous deep models are still not robust against adversarial attacks.
We propose a novel objective/loss function, which enforces the features to lie under a specified margin to facilitate their prediction.
arXiv Detail & Related papers (2020-12-08T20:51:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.