Related papers: Controllable Reasoning Models Are Private Thinkers

Controllable Reasoning Models Are Private Thinkers

URL: http://arxiv.org/abs/2602.24210v1
Date: Fri, 27 Feb 2026 17:39:10 GMT
Title: Controllable Reasoning Models Are Private Thinkers
Authors: Haritz Puerto, Haonan Li, Xudong Han, Timothy Baldwin, Iryna Gurevych,
Abstract summary: We propose training models to follow instructions not only in the final answer, but also in reasoning traces.<n>We fine-tune models on an instruction-following dataset with explicit restrictions on reasoning traces.<n>Our results show that improving instruction-following behavior in reasoning models can significantly enhance privacy.
Score: 74.40231123523115
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: AI agents powered by reasoning models require access to sensitive user data. However, their reasoning traces are difficult to control, which can result in the unintended leakage of private information to external parties. We propose training models to follow instructions not only in the final answer, but also in reasoning traces, potentially under different constraints. We hypothesize that improving their instruction following abilities in the reasoning traces can improve their privacy-preservation skills. To demonstrate this, we fine-tune models on a new instruction-following dataset with explicit restrictions on reasoning traces. We further introduce a generation strategy that decouples reasoning and answer generation using separate LoRA adapters. We evaluate our approach on six models from two model families, ranging from 1.7B to 14B parameters, across two instruction-following benchmarks and two privacy benchmarks. Our method yields substantial improvements, achieving gains of up to 20.9 points in instruction-following performance and up to 51.9 percentage points on privacy benchmarks. These improvements, however, can come at the cost of task utility, due to the trade-off between reasoning performance and instruction-following abilities. Overall, our results show that improving instruction-following behavior in reasoning models can significantly enhance privacy, suggesting a promising direction for the development of future privacy-aware agents. Our code and data are available at https://github.com/UKPLab/arxiv2026-controllable-reasoning-models

Related papers

Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning [22.45731787625021]
Embodied Chain-of-Thought (CoT) reasoning has significantly enhanced Vision-Language-Action (VLA) models.<n>Current methods rely on rigid templates to specify reasoning primitives.<n>We introduce R&B-EnCoRe, which enables models to bootstrap embodied reasoning from internet-scale knowledge.
arXiv Detail & Related papers (2026-02-09T00:10:17Z)
A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior [11.616524876789624]
LLM self-explanations are often presented as a promising tool for AI oversight, yet their faithfulness to the model's true reasoning process is poorly understood.<n>We introduce Normalized Simulata Gainbility (NSG), a metric based on the idea that a faithful explanation should allow an observer to learn a model's decision-making criteria.<n>We find self-explanations substantially improve prediction of model behavior (11-37% NSG)
arXiv Detail & Related papers (2026-02-02T18:54:51Z)
When Actions Teach You to Think: Reasoning-Action Synergy via Reinforcement Learning in Conversational Agents [2.689316553293938]
Supervised fine-tuning (SFT) has emerged as one of the most effective ways to improve the performance of large language models (LLMs) in downstream tasks.<n>We propose a pipeline in which LLMs generate reasoning steps that guide both the invocation of tools and the final answer generation for conversational agents.
arXiv Detail & Related papers (2025-12-12T04:44:40Z)
UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning [51.54456545661045]
We introduce the Instruction-as-Reasoning paradigm, treating instructions as dynamic analytical pathways that offer distinct perspectives.<n>To achieve this, we propose a two-stage training framework: supervised fine-tuning and reinforcement learning.<n>Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging grounding benchmarks.
arXiv Detail & Related papers (2025-10-23T07:18:32Z)
Evaluating Language Model Reasoning about Confidential Information [95.64687778185703]
We study whether language models exhibit contextual robustness, or the capability to adhere to context-dependent safety specifications.<n>We develop a benchmark (PasswordEval) that measures whether language models can correctly determine when a user request is authorized.<n>We find that current open- and closed-source models struggle with this seemingly simple task, and that, perhaps surprisingly, reasoning capabilities do not generally improve performance.
arXiv Detail & Related papers (2025-08-27T15:39:46Z)
Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following [37.69688837528397]
Reasoning models excel in complex problem solving but exhibit a concerning trade off between reasoning capabilities and instruction following abilities.<n>We propose a self-supervised RL framework that leverages reasoning models' own internal signals to improve instruction following capabilities without external supervision.
arXiv Detail & Related papers (2025-08-04T07:48:59Z)
Rationales Are Not Silver Bullets: Measuring the Impact of Rationales on Model Performance and Reliability [70.4107059502882]
Training language models with rationales augmentation has been shown to be beneficial in many existing works.<n>We conduct comprehensive investigations to thoroughly inspect the impact of rationales on model performance.
arXiv Detail & Related papers (2025-05-30T02:39:37Z)
Can Large Reasoning Models Self-Train? [51.0277533541394]
We use majority voting as a simple self-feedback mechanism to study whether self-training can be sustained within reinforcement learning.<n>We find that this basic approach improves not only the model's reasoning performance, but also its capability of generating better quality feedback for the next RL iteration.<n>Yet our analysis also reveals a critical limitation of such a self-training paradigm - prolonged RL with self-reward leads to reward hacking, resulting in sudden and complete performance collapse.
arXiv Detail & Related papers (2025-05-27T17:16:00Z)
ARMOR: Shielding Unlearnable Examples against Data Augmentation [25.289775916629505]
We propose a framework, dubbed ARMOR, to protect data privacy from potential breaches of data augmentation.<n> ARMOR reduces the test accuracy of the model trained on augmented protected samples by as much as 60% more than baselines.
arXiv Detail & Related papers (2025-01-15T15:22:57Z)
Pseudo-Probability Unlearning: Towards Efficient and Privacy-Preserving Machine Unlearning [59.29849532966454]
We propose PseudoProbability Unlearning (PPU), a novel method that enables models to forget data to adhere to privacy-preserving manner. Our method achieves over 20% improvements in forgetting error compared to the state-of-the-art.
arXiv Detail & Related papers (2024-11-04T21:27:06Z)
Improve Vision Language Model Chain-of-thought Reasoning [86.83335752119741]
Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness. We show that training VLM on short answers does not generalize well to reasoning tasks that require more detailed responses.
arXiv Detail & Related papers (2024-10-21T17:00:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.