LeakAgent: RL-based Red-teaming Agent for LLM Privacy Leakage
- URL: http://arxiv.org/abs/2412.05734v2
- Date: Fri, 08 Aug 2025 09:27:21 GMT
- Title: LeakAgent: RL-based Red-teaming Agent for LLM Privacy Leakage
- Authors: Yuzhou Nie, Zhun Wang, Ye Yu, Xian Wu, Xuandong Zhao, Wenbo Guo, Dawn Song,
- Abstract summary: LeakAgent is a novel black-box red-teaming framework for privacy leakage.<n>Our framework trains an open-source LLM through reinforcement learning as the attack agent to generate adversarial prompts.<n>We show that LeakAgent significantly outperforms existing rule-based approaches in training data extraction and automated methods in system prompt leakage.
- Score: 78.33839735526769
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent studies have discovered that large language models (LLM) may be ``fooled'' to output private information, including training data, system prompts, and personally identifiable information, under carefully crafted adversarial prompts. Existing red-teaming approaches for privacy leakage either rely on manual efforts or focus solely on system prompt extraction, making them ineffective for severe risks of training data leakage. We propose LeakAgent, a novel black-box red-teaming framework for LLM privacy leakage. Our framework trains an open-source LLM through reinforcement learning as the attack agent to generate adversarial prompts for both training data extraction and system prompt extraction. To achieve this, we propose a novel reward function to provide effective and fine-grained rewards and design novel mechanisms to balance exploration and exploitation during learning and enhance the diversity of adversarial prompts. Through extensive evaluations, we first show that LeakAgent significantly outperforms existing rule-based approaches in training data extraction and automated methods in system prompt leakage. We also demonstrate the effectiveness of LeakAgent in extracting system prompts from real-world applications in OpenAI's GPT Store. We further demonstrate LeakAgent's effectiveness in evading the existing guardrail defense and its helpfulness in enabling better safety alignment. Finally, we validate our customized designs through a detailed ablation study. We release our code here https://github.com/rucnyz/LeakAgent.
Related papers
- The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search [58.8834056209347]
Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs.<n>We introduce the Correlated Knowledge Attack Agent (CKA-Agent), a dynamic framework that reframes jailbreaking as an adaptive, tree-structured exploration of the target model's knowledge base.
arXiv Detail & Related papers (2025-12-01T07:05:23Z) - Adversarial Reinforcement Learning for Large Language Model Agent Safety [20.704989548285372]
Large Language Model (LLM) agents can leverage tools like Google Search to complete complex tasks.<n>Current defense strategies rely on fine-tuning LLM agents on datasets of known attacks.<n>We propose Adversarial Reinforcement Learning for Agent Safety (ARLAS), a novel framework that leverages adversarial reinforcement learning (RL) by formulating the problem as a two-player zero-sum game.
arXiv Detail & Related papers (2025-10-06T23:09:18Z) - WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning [73.91893534088798]
WebSailor is a complete post-training methodology designed to instill this crucial capability.<n>Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation.<n>WebSailor significantly outperforms all open-source agents in complex information-seeking tasks.
arXiv Detail & Related papers (2025-09-16T17:57:03Z) - Enhancing Jailbreak Attacks on LLMs via Persona Prompts [39.73624426612256]
Jailbreak attacks aim to exploit large language models (LLMs) by inducing them to generate harmful content, thereby revealing their vulnerabilities.<n>Previous jailbreak approaches have mainly focused on direct manipulations of harmful intent, with limited attention to the impact of persona prompts.<n>We propose a genetic algorithm-based method that automatically crafts persona prompts to bypass LLM's safety mechanisms.
arXiv Detail & Related papers (2025-07-28T12:03:22Z) - System Prompt Extraction Attacks and Defenses in Large Language Models [2.6986500640871482]
System prompt in Large Language Models (LLMs) plays a pivotal role in guiding model behavior and response generation.<n>Recent studies have shown that LLM system prompts are highly susceptible to extraction attacks through meticulously designed queries.<n>Despite the growing threat, there is a lack of systematic studies of system prompt extraction attacks and defenses.
arXiv Detail & Related papers (2025-05-27T21:36:27Z) - AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents [54.29555239363013]
We propose a generic black-box fuzzing framework, AgentVigil, to automatically discover and exploit indirect prompt injection vulnerabilities.<n>We evaluate AgentVigil on two public benchmarks, AgentDojo and VWA-adv, where it achieves 71% and 70% success rates against agents based on o3-mini and GPT-4o.<n>We apply our attacks in real-world environments, successfully misleading agents to navigate to arbitrary URLs, including malicious sites.
arXiv Detail & Related papers (2025-05-09T07:40:17Z) - Exploring Backdoor Attack and Defense for LLM-empowered Recommendations [15.098844020816552]
We propose a new attack framework termed Backdoor Injection Poisoning for RecSys (BadRec)
BadRec perturbs the items' titles with triggers and employs several fake users to interact with these items, effectively poisoning the training set and injecting backdoors into RecSys.
We propose a universal defense strategy called Poison Scanner (P-Scanner) to mitigate such a security threat.
arXiv Detail & Related papers (2025-04-15T13:37:38Z) - CheatAgent: Attacking LLM-Empowered Recommender Systems via LLM Agent [32.958798200220286]
Large Language Model (LLM)-empowered recommender systems (RecSys) have brought significant advances in personalized user experience.
We propose a novel attack framework called CheatAgent by harnessing the human-like capabilities of LLMs.
Our method first identifies the insertion position for maximum impact with minimal input modification.
arXiv Detail & Related papers (2025-04-13T05:31:37Z) - Automating Prompt Leakage Attacks on Large Language Models Using Agentic Approach [9.483655213280738]
This paper presents a novel approach to evaluating the security of large language models (LLMs)<n>We define prompt leakage as a critical threat to secure LLM deployment.<n>We implement a multi-agent system where cooperative agents are tasked with probing and exploiting the target LLM to elicit its prompt.
arXiv Detail & Related papers (2025-02-18T08:17:32Z) - Unveiling Privacy Risks in LLM Agent Memory [40.26158509307175]
Large Language Model (LLM) agents have become increasingly prevalent across various real-world applications.
They enhance decision-making by storing private user-agent interactions in the memory module for demonstrations.
We propose a Memory EXTRaction Attack (MEXTRA) to extract private information from memory.
arXiv Detail & Related papers (2025-02-17T19:55:53Z) - Data Extraction Attacks in Retrieval-Augmented Generation via Backdoors [15.861833242429228]
We investigate data extraction attacks targeting RAG's knowledge databases.
We show that previous prompt injection-based extraction attacks largely rely on the instruction-following capabilities of LLMs.
We propose to backdoor RAG, where a small portion of poisoned data is injected during the fine-tuning phase to create a backdoor within the LLM.
arXiv Detail & Related papers (2024-11-03T22:27:40Z) - Attention Tracker: Detecting Prompt Injection Attacks in LLMs [62.247841717696765]
Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks.<n>We introduce the concept of the distraction effect, where specific attention heads shift focus from the original instruction to the injected instruction.<n>We propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks.
arXiv Detail & Related papers (2024-11-01T04:05:59Z) - Aligning LLMs to Be Robust Against Prompt Injection [55.07562650579068]
We show that alignment can be a powerful tool to make LLMs more robust against prompt injection attacks.
Our method -- SecAlign -- first builds an alignment dataset by simulating prompt injection attacks.
Our experiments show that SecAlign robustifies the LLM substantially with a negligible hurt on model utility.
arXiv Detail & Related papers (2024-10-07T19:34:35Z) - Evaluating Large Language Model based Personal Information Extraction and Countermeasures [63.91918057570824]
Large language model (LLM) can be misused by attackers to accurately extract various personal information from personal profiles.
LLM outperforms conventional methods at such extraction.
prompt injection can mitigate such risk to a large extent and outperforms conventional countermeasures.
arXiv Detail & Related papers (2024-08-14T04:49:30Z) - MaPPing Your Model: Assessing the Impact of Adversarial Attacks on LLM-based Programming Assistants [14.947665219536708]
We introduce the Malicious Programming Prompt (MaPP) attack, in which an attacker adds a small amount of text to a prompt for a programming task.
We show that our prompt strategy can cause an LLM to add vulnerabilities while continuing to write otherwise correct code.
arXiv Detail & Related papers (2024-07-12T22:30:35Z) - BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents [26.057916556444333]
We show that such methods are vulnerable to our proposed backdoor attacks named BadAgent.
Our proposed attack methods are extremely robust even after fine-tuning on trustworthy data.
arXiv Detail & Related papers (2024-06-05T07:14:28Z) - Get my drift? Catching LLM Task Drift with Activation Deltas [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users.<n>We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set.<n>We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z) - Prompt Leakage effect and defense strategies for multi-turn LLM interactions [95.33778028192593]
Leakage of system prompts may compromise intellectual property and act as adversarial reconnaissance for an attacker.
We design a unique threat model which leverages the LLM sycophancy effect and elevates the average attack success rate (ASR) from 17.7% to 86.2% in a multi-turn setting.
We measure the mitigation effect of 7 black-box defense strategies, along with finetuning an open-source model to defend against leakage attempts.
arXiv Detail & Related papers (2024-04-24T23:39:58Z) - CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models [6.931433424951554]
Large language models (LLMs) introduce new security risks, but there are few comprehensive evaluation suites to measure and reduce these risks.
We present BenchmarkName, a novel benchmark to quantify LLM security risks and capabilities.
We evaluate multiple state-of-the-art (SOTA) LLMs, including GPT-4, Mistral, Meta Llama 3 70B-Instruct, and Code Llama.
arXiv Detail & Related papers (2024-04-19T20:11:12Z) - The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented
Generation (RAG) [56.67603627046346]
Retrieval-augmented generation (RAG) is a powerful technique to facilitate language model with proprietary and private data.
In this work, we conduct empirical studies with novel attack methods, which demonstrate the vulnerability of RAG systems on leaking the private retrieval database.
arXiv Detail & Related papers (2024-02-23T18:35:15Z) - Learning to Poison Large Language Models During Instruction Tuning [12.521338629194503]
This work identifies additional security risks in Large Language Models (LLMs) by designing a new data poisoning attack tailored to exploit the instruction tuning process.
We propose a novel gradient-guided backdoor trigger learning (GBTL) algorithm to identify adversarial triggers efficiently.
We propose two defense strategies against data poisoning attacks, including in-context learning (ICL) and continuous learning (CL)
arXiv Detail & Related papers (2024-02-21T01:30:03Z) - Benchmarking and Defending Against Indirect Prompt Injection Attacks on
Large Language Models [82.98081731588717]
Integration of large language models with external content exposes applications to indirect prompt injection attacks.
We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to evaluate the risk of such attacks.
We develop two black-box methods based on prompt learning and a white-box defense method based on fine-tuning with adversarial training.
arXiv Detail & Related papers (2023-12-21T01:08:39Z) - Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game [86.66627242073724]
This paper presents a dataset of over 126,000 prompt injection attacks and 46,000 prompt-based "defenses" against prompt injection.
To the best of our knowledge, this is currently the largest dataset of human-generated adversarial examples for instruction-following LLMs.
We also use the dataset to create a benchmark for resistance to two types of prompt injection, which we refer to as prompt extraction and prompt hijacking.
arXiv Detail & Related papers (2023-11-02T06:13:36Z) - Attack Prompt Generation for Red Teaming and Defending Large Language
Models [70.157691818224]
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content.
We propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts.
arXiv Detail & Related papers (2023-10-19T06:15:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.