STaR: Sensitive Trajectory Regulation for Unlearning in Large Reasoning Models
- URL: http://arxiv.org/abs/2601.09281v1
- Date: Wed, 14 Jan 2026 08:35:23 GMT
- Title: STaR: Sensitive Trajectory Regulation for Unlearning in Large Reasoning Models
- Authors: Jingjing Zhou, Gaoxiang Cong, Li Su, Liang Li,
- Abstract summary: We propose a parameter-free, inference-time unlearning framework that achieves robust privacy protection throughout the reasoning process.<n>Experiments on the R-TOFU benchmark demonstrate that STaR achieves comprehensive and stable unlearning with minimal utility loss.
- Score: 12.133996629992318
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Reasoning Models (LRMs) have advanced automated multi-step reasoning, but their ability to generate complex Chain-of-Thought (CoT) trajectories introduces severe privacy risks, as sensitive information may be deeply embedded throughout the reasoning process. Existing Large Language Models (LLMs) unlearning approaches that typically focus on modifying only final answers are insufficient for LRMs, as they fail to remove sensitive content from intermediate steps, leading to persistent privacy leakage and degraded security. To address these challenges, we propose Sensitive Trajectory Regulation (STaR), a parameter-free, inference-time unlearning framework that achieves robust privacy protection throughout the reasoning process. Specifically, we first identify sensitive content via semantic-aware detection. Then, we inject global safety constraints through secure prompt prefix. Next, we perform trajectory-aware suppression to dynamically block sensitive content across the entire reasoning chain. Finally, we apply token-level adaptive filtering to prevent both exact and paraphrased sensitive tokens during generation. Furthermore, to overcome the inadequacies of existing evaluation protocols, we introduce two metrics: Multi-Decoding Consistency Assessment (MCS), which measures the consistency of unlearning across diverse decoding strategies, and Multi-Granularity Membership Inference Attack (MIA) Evaluation, which quantifies privacy protection at both answer and reasoning-chain levels. Experiments on the R-TOFU benchmark demonstrate that STaR achieves comprehensive and stable unlearning with minimal utility loss, setting a new standard for privacy-preserving reasoning in LRMs.
Related papers
- SafeRBench: A Comprehensive Benchmark for Safety Assessment in Large Reasoning Models [60.8821834954637]
We present SafeRBench, the first benchmark that assesses LRM safety end-to-end.<n>We pioneer the incorporation of risk categories and levels into input design.<n>We introduce a micro-thought chunking mechanism to segment long reasoning traces into semantically coherent units.
arXiv Detail & Related papers (2025-11-19T06:46:33Z) - Is Your Prompt Poisoning Code? Defect Induction Rates and Security Mitigation Strategies [4.435429537888066]
Large language models (LLMs) have become indispensable for automated code generation, yet the quality and security of their outputs remain a critical concern.<n>We propose an evaluation framework for prompt quality encompassing three key dimensions: goal clarity, information completeness, and logical consistency.<n>Our findings highlight that enhancing the quality of user prompts constitutes a critical and effective strategy for strengthening the security of AI-generated code.
arXiv Detail & Related papers (2025-10-27T02:59:17Z) - Distractor Injection Attacks on Large Reasoning Models: Characterization and Defense [16.519353449118814]
We analyze a critical vulnerability we term reasoning distraction, where LRMs are diverted from their primary objective by irrelevant yet complex tasks maliciously embedded in the prompt.<n>We show that even state-of-the-art LRMs are highly susceptible, with injected distractors reducing task accuracy by up to 60%.<n>We propose a training-based defense that combines Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on synthetic adversarial data, improving robustness by over 50 points on challenging distractor attacks.
arXiv Detail & Related papers (2025-10-17T23:16:34Z) - One Token Embedding Is Enough to Deadlock Your Large Reasoning Model [91.48868589442837]
We present the Deadlock Attack, a resource exhaustion method that hijacks an LRM's generative control flow.<n>Our method achieves a 100% attack success rate across four advanced LRMs.
arXiv Detail & Related papers (2025-10-12T07:42:57Z) - AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models [62.70575022567081]
We propose AdvChain, an alignment paradigm that teaches models dynamic self-correction through adversarial CoT tuning.<n>Our work establishes a new direction for building more robust and reliable reasoning models.
arXiv Detail & Related papers (2025-09-29T04:27:23Z) - LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation [4.29885665563186]
LATENTGUARD is a framework that combines behavioral alignment with supervised latent space control for interpretable and precise safety steering.<n>Our results show significant improvements in both safety controllability and response interpretability without compromising utility.
arXiv Detail & Related papers (2025-09-24T07:31:54Z) - Steering Multimodal Large Language Models Decoding for Context-Aware Safety [40.668741064553025]
Multimodal Large Language Models (MLLMs) are increasingly deployed in real-world applications.<n>Existing methods fail to balance oversensitivity (unjustified refusals of benign queries) and undersensitivity (missed detection of visually grounded risks)<n>We introduce Safety-aware Contrastive Decoding (SafeCoDe), a lightweight and model-agnostic decoding framework that dynamically adjusts token generation based on multimodal context.
arXiv Detail & Related papers (2025-09-23T16:32:25Z) - Inducing Faithfulness in Structured Reasoning via Counterfactual Sensitivity [6.908972852063454]
Large language models often generate a correct answer while relying on a flawed or irrelevant reasoning trace.<n>This paper introduces textbfCounterfactual Sensitivity Regularization (CSR), a novel training objective.<n>CSR improves faithfulness over standard fine-tuning and process supervision by up to 70 percentage points.
arXiv Detail & Related papers (2025-09-01T15:18:46Z) - IntentionReasoner: Facilitating Adaptive LLM Safeguards through Intent Reasoning and Selective Query Refinement [35.904652937034136]
We introduce IntentionReasoner, a novel safeguard mechanism that leverages a dedicated guard model to perform intent reasoning.<n>We show that IntentionReasoner excels in multiple safeguard benchmarks, generation quality evaluations, and jailbreak attack scenarios.
arXiv Detail & Related papers (2025-08-27T16:47:31Z) - Advancing Neural Network Verification through Hierarchical Safety Abstract Interpretation [52.626086874715284]
We introduce a novel problem formulation called Abstract DNN-Verification, which verifies a hierarchical structure of unsafe outputs.<n>By leveraging abstract interpretation and reasoning about output reachable sets, our approach enables assessing multiple safety levels during the formal verification process.<n>Our contributions include a theoretical exploration of the relationship between our novel abstract safety formulation and existing approaches.
arXiv Detail & Related papers (2025-05-08T13:29:46Z) - Towards Copyright Protection for Knowledge Bases of Retrieval-augmented Language Models via Reasoning [58.57194301645823]
Large language models (LLMs) are increasingly integrated into real-world personalized applications.<n>The valuable and often proprietary nature of the knowledge bases used in RAG introduces the risk of unauthorized usage by adversaries.<n>Existing methods that can be generalized as watermarking techniques to protect these knowledge bases typically involve poisoning or backdoor attacks.<n>We propose name for harmless' copyright protection of knowledge bases.
arXiv Detail & Related papers (2025-02-10T09:15:56Z) - Online Safety Property Collection and Refinement for Safe Deep
Reinforcement Learning in Mapless Navigation [79.89605349842569]
We introduce the Collection and Refinement of Online Properties (CROP) framework to design properties at training time.
CROP employs a cost signal to identify unsafe interactions and use them to shape safety properties.
We evaluate our approach in several robotic mapless navigation tasks and demonstrate that the violation metric computed with CROP allows higher returns and lower violations over previous Safe DRL approaches.
arXiv Detail & Related papers (2023-02-13T21:19:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.