Sanitize Your Responses: Mitigating Privacy Leakage in Large Language Models
- URL: http://arxiv.org/abs/2509.24488v1
- Date: Mon, 29 Sep 2025 08:59:44 GMT
- Title: Sanitize Your Responses: Mitigating Privacy Leakage in Large Language Models
- Authors: Wenjie Fu, Huandong Wang, Junyao Gao, Guoan Wan, Tao Jiang,
- Abstract summary: Self-Sanitize is a novel LLM-driven mitigation framework inspired by cognitive psychology.<n>It emulates human self-monitor and self-repair behaviors during conversations.<n>It achieves superior mitigation performance with minimal overhead and without degrading the utility of LLMs.
- Score: 15.90085929279269
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As Large Language Models (LLMs) achieve remarkable success across a wide range of applications, such as chatbots and code copilots, concerns surrounding the generation of harmful content have come increasingly into focus. Despite significant advances in aligning LLMs with safety and ethical standards, adversarial prompts can still be crafted to elicit undesirable responses. Existing mitigation strategies are predominantly based on post-hoc filtering, which introduces substantial latency or computational overhead, and is incompatible with token-level streaming generation. In this work, we introduce Self-Sanitize, a novel LLM-driven mitigation framework inspired by cognitive psychology, which emulates human self-monitor and self-repair behaviors during conversations. Self-Sanitize comprises a lightweight Self-Monitor module that continuously inspects high-level intentions within the LLM at the token level via representation engineering, and a Self-Repair module that performs in-place correction of harmful content without initiating separate review dialogues. This design allows for real-time streaming monitoring and seamless repair, with negligible impact on latency and resource utilization. Given that privacy-invasive content has often been insufficiently focused in previous studies, we perform extensive experiments on four LLMs across three privacy leakage scenarios. The results demonstrate that Self-Sanitize achieves superior mitigation performance with minimal overhead and without degrading the utility of LLMs, offering a practical and robust solution for safer LLM deployments. Our code is available at the following link: https://github.com/wjfu99/LLM_Self_Sanitize
Related papers
- From Insight to Exploit: Leveraging LLM Collaboration for Adaptive Adversarial Text Generation [3.75886080255807]
We introduce two innovative attack frameworks designed to generate dynamic and adaptive adversarial examples.<n>We produce subtle and natural-looking adversarial inputs that preserve semantic similarity to the original text.<n>Our attacks evolve with the advancements in LLMs and demonstrate strong transferability acrossversa unknown to the attacker.
arXiv Detail & Related papers (2025-11-05T02:27:56Z) - Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs [72.08224879435762]
textttLearn-to-Ask is a simulator-free framework for learning and deploying proactive dialogue agents.<n>Our approach culminates in the successful deployment of LLMs into a live, large-scale online AI service.
arXiv Detail & Related papers (2025-10-29T12:08:07Z) - From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring [12.505882642773829]
Existing moderators mainly practice a conventional full detection, which determines the harmfulness based on the complete LLM output.<n>Recent works pay more attention to partial detection where moderators oversee the generation midway and early stop the output if harmfulness is detected.<n>We propose the streaming content monitor, which is trained with dual supervision of response- and token-level labels.
arXiv Detail & Related papers (2025-06-11T17:59:58Z) - Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling [56.26834106704781]
Factual incorrectness in generated content is one of the primary concerns in ubiquitous deployment of large language models (LLMs)<n>We provide evidence supporting the presence of LLMs' internal compass that dictate the correctness of factual recall at the time of generation.<n>Scaling experiments across model sizes and training dynamics highlight that self-awareness emerges rapidly during training and peaks in intermediate layers.
arXiv Detail & Related papers (2025-05-27T16:24:02Z) - What You See Is Not Always What You Get: An Empirical Study of Code Comprehension by Large Language Models [0.5735035463793009]
We investigate the vulnerability of large language models (LLMs) to imperceptible attacks, where hidden character manipulation in source code misleads LLMs' behaviour while remaining undetectable to human reviewers.<n>These attacks include coding reordering, invisible coding characters, code deletions, and code homoglyphs.<n>Our findings confirm the susceptibility of LLMs to imperceptible coding character attacks, while different LLMs present different negative correlations between perturbation magnitude and performance.
arXiv Detail & Related papers (2024-12-11T04:52:41Z) - Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM [53.79753074854936]
Large language models (LLMs) are increasingly vulnerable to emerging jailbreak attacks.<n>This vulnerability poses significant risks to real-world applications.<n>We propose a novel defensive paradigm called GuidelineLLM.
arXiv Detail & Related papers (2024-12-10T12:42:33Z) - DIESEL -- Dynamic Inference-Guidance via Evasion of Semantic Embeddings in LLMs [23.441711206966914]
Diesel is a lightweight inference-guidance technique that can be seamlessly integrated into any autoregressive LLM.<n>It semantically filters undesired concepts from the response.<n>Our evaluation demonstrates Diesel's effectiveness on state-of-the-art conversational models.
arXiv Detail & Related papers (2024-11-28T10:33:11Z) - From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning [91.79567270986901]
Large Language Models (LLMs) tend to prioritize adherence to user prompts over providing veracious responses.<n>Recent works propose to employ supervised fine-tuning (SFT) to mitigate the sycophancy issue.<n>We propose a novel supervised pinpoint tuning (SPT), where the region-of-interest modules are tuned for a given objective.
arXiv Detail & Related papers (2024-09-03T07:01:37Z) - RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content [62.685566387625975]
Current mitigation strategies, while effective, are not resilient under adversarial attacks.
This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently moderate harmful and unsafe inputs.
arXiv Detail & Related papers (2024-03-19T07:25:02Z) - UNDIAL: Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models [12.45822383965784]
We introduce UnDIAL (Unlearning via Self-Distillation on Adjusted Logits), a novel and robust unlearning method.
Our approach leverages self-distillation to adjust logits and selectively reduce the influence of targeted tokens.
arXiv Detail & Related papers (2024-02-15T16:21:14Z) - Machine Unlearning in Large Language Models [8.14992136443131]
This paper introduces a novel machine unlearning framework into large language models.
Our objectives are to make LLMs not produce harmful, hallucinatory, or privacy-compromising responses.
Experimental results show that our approach effectively meets unlearning objectives without substantially compromising model performance.
arXiv Detail & Related papers (2024-02-03T05:14:56Z) - Check Your Facts and Try Again: Improving Large Language Models with
External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks.
This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.