Related papers: PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage

PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage

URL: http://arxiv.org/abs/2507.02332v2
Date: Tue, 19 Aug 2025 04:18:58 GMT
Title: PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage
Authors: Krishna Kanth Nakka, Xue Jiang, Dmitrii Usynin, Xuebing Zhou,
Abstract summary: This paper focuses on whether manipulating activations can bypass LLM alignment and alter response behaviors to privacy related queries.<n>We identify attention heads of predictive refusal behavior for private attributes using lightweight linear probes trained with privacy evaluator labels.<n>We steer the activations of a small subset of these attention heads guided by the trained probes to induce the model to generate non-refusal responses.
Score: 9.594287563250349
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper investigates privacy jailbreaking in LLMs via steering, focusing on whether manipulating activations can bypass LLM alignment and alter response behaviors to privacy related queries (e.g., a certain public figure's sexual orientation). We begin by identifying attention heads predictive of refusal behavior for private attributes (e.g., sexual orientation) using lightweight linear probes trained with privacy evaluator labels. Next, we steer the activations of a small subset of these attention heads guided by the trained probes to induce the model to generate non-refusal responses. Our experiments show that these steered responses often disclose sensitive attribute details, along with other private information about data subjects such as life events, relationships, and personal histories that the models would typically refuse to produce. Evaluations across four LLMs reveal jailbreaking disclosure rates of at least 95%, with more than 50% on average of these responses revealing true personal information. Our controlled study demonstrates that private information memorized in LLMs can be extracted through targeted manipulation of internal activations.

Related papers

Guarding Your Conversations: Privacy Gatekeepers for Secure Interactions with Cloud-Based AI Models [0.34998703934432673]
We propose the concept of an "LLM gatekeeper", a lightweight, locally run model that filters out sensitive information from user queries before they are sent to the potentially untrustworthy, though highly capable, cloud-based LLM.<n>Through experiments with human subjects, we demonstrate that this dual-model approach introduces minimal overhead while significantly enhancing user privacy, without compromising the quality of LLM responses.
arXiv Detail & Related papers (2025-08-22T19:49:03Z)
MAGPIE: A dataset for Multi-AGent contextual PrIvacy Evaluation [54.410825977390274]
Existing benchmarks to evaluate contextual privacy in LLM-agents primarily assess single-turn, low-complexity tasks.<n>We first present a benchmark - MAGPIE comprising 158 real-life high-stakes scenarios across 15 domains.<n>We then evaluate the current state-of-the-art LLMs on their understanding of contextually private data and their ability to collaborate without violating user privacy.
arXiv Detail & Related papers (2025-06-25T18:04:25Z)
Automated Privacy Information Annotation in Large Language Model Interactions [40.87806981624453]
Users interacting with large language models (LLMs) under their real identifiers often unknowingly risk disclosing private information.<n>Existing privacy detection methods were designed for different objectives and application scenarios.<n>We construct a large-scale multilingual dataset with 249K user queries and 154K annotated privacy phrases.
arXiv Detail & Related papers (2025-05-27T09:00:12Z)
Differentially Private Steering for Large Language Model Alignment [55.30573701583768]
We present the first study of aligning Large Language Models with private datasets.<n>Our work proposes the Private Steering for LLM Alignment (PSA) algorithm to edit activations with differential privacy guarantees.<n>Our results show that PSA achieves DP guarantees for LLM alignment with minimal loss in performance.
arXiv Detail & Related papers (2025-01-30T17:58:36Z)
Investigating Privacy Bias in Training Data of Language Models [1.3167450470598043]
A privacy bias refers to the skew in the appropriateness of information flows within a given context.<n>This skew may either align with existing expectations or signal a symptom of systemic issues.<n>We present a novel approach to assess the privacy biases using a contextual integrity-based methodology.
arXiv Detail & Related papers (2024-09-05T17:50:31Z)
PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action [54.11479432110771]
PrivacyLens is a novel framework designed to extend privacy-sensitive seeds into expressive vignettes and further into agent trajectories.<n>We instantiate PrivacyLens with a collection of privacy norms grounded in privacy literature and crowdsourced seeds.<n>State-of-the-art LMs, like GPT-4 and Llama-3-70B, leak sensitive information in 25.68% and 38.69% of cases, even when prompted with privacy-enhancing instructions.
arXiv Detail & Related papers (2024-08-29T17:58:38Z)
LLM-PBE: Assessing Data Privacy in Large Language Models [111.58198436835036]
Large Language Models (LLMs) have become integral to numerous domains, significantly advancing applications in data management, mining, and analysis. Despite the critical nature of this issue, there has been no existing literature to offer a comprehensive assessment of data privacy risks in LLMs. Our paper introduces LLM-PBE, a toolkit crafted specifically for the systematic evaluation of data privacy risks in LLMs.
arXiv Detail & Related papers (2024-08-23T01:37:29Z)
Evaluating LLM-based Personal Information Extraction and Countermeasures [63.91918057570824]
Large language model (LLM) based personal information extraction can be benchmarked.<n>LLM can be misused by attackers to accurately extract various personal information from personal profiles.<n> prompt injection can defend against strong LLM-based attacks, reducing the attack to less effective traditional ones.
arXiv Detail & Related papers (2024-08-14T04:49:30Z)
Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory [82.7042006247124]
We show that even the most capable AI models reveal private information in contexts that humans would not, 39% and 57% of the time, respectively. Our work underscores the immediate need to explore novel inference-time privacy-preserving approaches, based on reasoning and theory of mind.
arXiv Detail & Related papers (2023-10-27T04:15:30Z)
Beyond Memorization: Violating Privacy Via Inference with Large Language Models [2.9373912230684565]
We present the first comprehensive study on the capabilities of pretrained language models to infer personal attributes from text. Our findings highlight that current LLMs can infer personal data at a previously unattainable scale.
arXiv Detail & Related papers (2023-10-11T08:32:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.