SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought
- URL: http://arxiv.org/abs/2511.07772v1
- Date: Wed, 12 Nov 2025 01:17:12 GMT
- Title: SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought
- Authors: Shourya Batra, Pierce Tillman, Samarth Gaggar, Shashank Kesineni, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Vasu Sharma, Maheep Chaudhary,
- Abstract summary: Large Language Models (LLMs) evolve into personal assistants with access to sensitive user data.<n>Recent findings reveal that LLMs often leak private information through their internal reasoning processes, violating contextual privacy expectations.<n>We introduce Steering Activations towards Leakage-free Thinking (SALT), a lightweight test-time intervention that mitigates privacy leakage in model's Chain of Thought.
- Score: 8.165127822088499
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As Large Language Models (LLMs) evolve into personal assistants with access to sensitive user data, they face a critical privacy challenge: while prior work has addressed output-level privacy, recent findings reveal that LLMs often leak private information through their internal reasoning processes, violating contextual privacy expectations. These leaky thoughts occur when models inadvertently expose sensitive details in their reasoning traces, even when final outputs appear safe. The challenge lies in preventing such leakage without compromising the model's reasoning capabilities, requiring a delicate balance between privacy and utility. We introduce Steering Activations towards Leakage-free Thinking (SALT), a lightweight test-time intervention that mitigates privacy leakage in model's Chain of Thought (CoT) by injecting targeted steering vectors into hidden state. We identify the high-leakage layers responsible for this behavior. Through experiments across multiple LLMs, we demonstrate that SALT achieves reductions including $18.2\%$ reduction in CPL on QwQ-32B, $17.9\%$ reduction in CPL on Llama-3.1-8B, and $31.2\%$ reduction in CPL on Deepseek in contextual privacy leakage dataset AirGapAgent-R while maintaining comparable task performance and utility. Our work establishes SALT as a practical approach for test-time privacy protection in reasoning-capable language models, offering a path toward safer deployment of LLM-based personal agents.
Related papers
- NeuroFilter: Privacy Guardrails for Conversational LLM Agents [50.75206727081996]
This work addresses the computational challenge of enforcing privacy for agentic Large Language Models (LLMs)<n>NeuroFilter is a guardrail framework that operationalizes contextual integrity by mapping norm violations to simple directions in the model's activation space.<n>A comprehensive evaluation across over 150,000 interactions, covering models from 7B to 70B parameters, illustrates the strong performance of NeuroFilter.
arXiv Detail & Related papers (2026-01-21T05:16:50Z) - HearSay Benchmark: Do Audio LLMs Leak What They Hear? [71.05839007164776]
This paper investigates whether Audio Large Language Models inadvertently leak user privacy solely through acoustic voiceprints.<n>$textitHearSay$ is a comprehensive benchmark constructed from over 22,000 real-world audio clips.<n>Experiments on $textbfSignificant Privacy Leakage$ yield three critical findings.
arXiv Detail & Related papers (2026-01-07T10:33:44Z) - SWAP: Towards Copyright Auditing of Soft Prompts via Sequential Watermarking [58.475471437150674]
We propose sequential watermarking for soft prompts (SWAP)<n>SWAP encodes watermarks through a specific order of defender-specified out-of-distribution classes.<n>Experiments on 11 datasets demonstrate SWAP's effectiveness, harmlessness, and robustness against potential adaptive attacks.
arXiv Detail & Related papers (2025-11-05T13:48:48Z) - Sanitize Your Responses: Mitigating Privacy Leakage in Large Language Models [15.90085929279269]
Self-Sanitize is a novel LLM-driven mitigation framework inspired by cognitive psychology.<n>It emulates human self-monitor and self-repair behaviors during conversations.<n>It achieves superior mitigation performance with minimal overhead and without degrading the utility of LLMs.
arXiv Detail & Related papers (2025-09-29T08:59:44Z) - Privacy-Aware Decoding: Mitigating Privacy Leakage of Large Language Models in Retrieval-Augmented Generation [26.573578326262307]
Privacy-Aware Decoding (PAD) is a lightweight, inference-time defense that adaptively injects calibrated Gaussian noise into token logits during generation.<n>PAD integrates confidence-based screening to selectively protect high-risk tokens, efficient sensitivity estimation to minimize unnecessary noise, and context-aware noise calibration to balance privacy with generation quality.<n>Our work takes an important step toward mitigating privacy risks in RAG via decoding strategies, paving the way for universal and scalable privacy solutions in sensitive domains.
arXiv Detail & Related papers (2025-08-05T05:22:13Z) - PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage [9.594287563250349]
This paper focuses on whether manipulating activations can bypass LLM alignment and alter response behaviors to privacy related queries.<n>We identify attention heads of predictive refusal behavior for private attributes using lightweight linear probes trained with privacy evaluator labels.<n>We steer the activations of a small subset of these attention heads guided by the trained probes to induce the model to generate non-refusal responses.
arXiv Detail & Related papers (2025-07-03T05:50:50Z) - PrivacyScalpel: Enhancing LLM Privacy via Interpretable Feature Intervention with Sparse Autoencoders [8.483679748399037]
Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing but pose privacy risks by memorizing and leaking Personally Identifiable Information (PII)<n>Existing mitigation strategies, such as differential privacy and neuron-level interventions, often degrade model utility or fail to effectively prevent leakage.<n>We introduce PrivacyScalpel, a novel privacy-preserving framework that leverages interpretability techniques to identify and mitigate PII leakage while maintaining performance.
arXiv Detail & Related papers (2025-03-14T09:31:01Z) - Differentially Private Steering for Large Language Model Alignment [55.30573701583768]
We present the first study of aligning Large Language Models with private datasets.<n>Our work proposes the Private Steering for LLM Alignment (PSA) algorithm to edit activations with differential privacy guarantees.<n>Our results show that PSA achieves DP guarantees for LLM alignment with minimal loss in performance.
arXiv Detail & Related papers (2025-01-30T17:58:36Z) - PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action [54.11479432110771]
PrivacyLens is a novel framework designed to extend privacy-sensitive seeds into expressive vignettes and further into agent trajectories.<n>We instantiate PrivacyLens with a collection of privacy norms grounded in privacy literature and crowdsourced seeds.<n>State-of-the-art LMs, like GPT-4 and Llama-3-70B, leak sensitive information in 25.68% and 38.69% of cases, even when prompted with privacy-enhancing instructions.
arXiv Detail & Related papers (2024-08-29T17:58:38Z) - Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory [82.7042006247124]
We show that even the most capable AI models reveal private information in contexts that humans would not, 39% and 57% of the time, respectively.
Our work underscores the immediate need to explore novel inference-time privacy-preserving approaches, based on reasoning and theory of mind.
arXiv Detail & Related papers (2023-10-27T04:15:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.