PrivAgent: Agentic-based Red-teaming for LLM Privacy Leakage
- URL: http://arxiv.org/abs/2412.05734v1
- Date: Sat, 07 Dec 2024 20:09:01 GMT
- Title: PrivAgent: Agentic-based Red-teaming for LLM Privacy Leakage
- Authors: Yuzhou Nie, Zhun Wang, Ye Yu, Xian Wu, Xuandong Zhao, Wenbo Guo, Dawn Song,
- Abstract summary: LLMs may be fooled into outputting private information under carefully crafted adversarial prompts.
PrivAgent is a novel black-box red-teaming framework for privacy leakage.
- Score: 78.33839735526769
- License:
- Abstract: Recent studies have discovered that LLMs have serious privacy leakage concerns, where an LLM may be fooled into outputting private information under carefully crafted adversarial prompts. These risks include leaking system prompts, personally identifiable information, training data, and model parameters. Most existing red-teaming approaches for privacy leakage rely on humans to craft the adversarial prompts. A few automated methods are proposed for system prompt extraction, but they cannot be applied to more severe risks (e.g., training data extraction) and have limited effectiveness even for system prompt extraction. In this paper, we propose PrivAgent, a novel black-box red-teaming framework for LLM privacy leakage. We formulate different risks as a search problem with a unified attack goal. Our framework trains an open-source LLM through reinforcement learning as the attack agent to generate adversarial prompts for different target models under different risks. We propose a novel reward function to provide effective and fine-grained rewards for the attack agent. Finally, we introduce customizations to better fit our general framework to system prompt extraction and training data extraction. Through extensive evaluations, we first show that PrivAgent outperforms existing automated methods in system prompt leakage against six popular LLMs. Notably, our approach achieves a 100% success rate in extracting system prompts from real-world applications in OpenAI's GPT Store. We also show PrivAgent's effectiveness in extracting training data from an open-source LLM with a success rate of 5.9%. We further demonstrate PrivAgent's effectiveness in evading the existing guardrail defense and its helpfulness in enabling better safety alignment. Finally, we validate our customized designs through a detailed ablation study. We release our code here https://github.com/rucnyz/RedAgent.
Related papers
- Automating Prompt Leakage Attacks on Large Language Models Using Agentic Approach [9.483655213280738]
This paper presents a novel approach to evaluating the security of large language models (LLMs)
We define prompt leakage as a critical threat to secure LLM deployment.
We implement a multi-agent system where cooperative agents are tasked with probing and exploiting the target LLM to elicit its prompt.
arXiv Detail & Related papers (2025-02-18T08:17:32Z) - Unveiling Privacy Risks in LLM Agent Memory [40.26158509307175]
Large Language Model (LLM) agents have become increasingly prevalent across various real-world applications.
They enhance decision-making by storing private user-agent interactions in the memory module for demonstrations.
We propose a Memory EXTRaction Attack (MEXTRA) to extract private information from memory.
arXiv Detail & Related papers (2025-02-17T19:55:53Z) - SecAlign: Defending Against Prompt Injection with Preference Optimization [52.48001255555192]
Adrial prompts can be injected into external data sources to override the system's intended instruction and execute a malicious instruction.
We propose a new defense called SecAlign based on the technique of preference optimization.
Our method reduces the success rates of various prompt injections to around 0%, even against attacks much more sophisticated than ones seen during training.
arXiv Detail & Related papers (2024-10-07T19:34:35Z) - GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning [79.07152553060601]
Existing methods for enhancing the safety of large language models (LLMs) are not directly transferable to LLM-powered agents.
We propose GuardAgent, the first LLM agent as a guardrail to other LLM agents.
GuardAgent comprises two steps: 1) creating a task plan by analyzing the provided guard requests, and 2) generating guardrail code based on the task plan and executing the code by calling APIs or using external engines.
arXiv Detail & Related papers (2024-06-13T14:49:26Z) - BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents [26.057916556444333]
We show that such methods are vulnerable to our proposed backdoor attacks named BadAgent.
Our proposed attack methods are extremely robust even after fine-tuning on trustworthy data.
arXiv Detail & Related papers (2024-06-05T07:14:28Z) - Prompt Leakage effect and defense strategies for multi-turn LLM interactions [95.33778028192593]
Leakage of system prompts may compromise intellectual property and act as adversarial reconnaissance for an attacker.
We design a unique threat model which leverages the LLM sycophancy effect and elevates the average attack success rate (ASR) from 17.7% to 86.2% in a multi-turn setting.
We measure the mitigation effect of 7 black-box defense strategies, along with finetuning an open-source model to defend against leakage attempts.
arXiv Detail & Related papers (2024-04-24T23:39:58Z) - CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models [6.931433424951554]
Large language models (LLMs) introduce new security risks, but there are few comprehensive evaluation suites to measure and reduce these risks.
We present BenchmarkName, a novel benchmark to quantify LLM security risks and capabilities.
We evaluate multiple state-of-the-art (SOTA) LLMs, including GPT-4, Mistral, Meta Llama 3 70B-Instruct, and Code Llama.
arXiv Detail & Related papers (2024-04-19T20:11:12Z) - The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented
Generation (RAG) [56.67603627046346]
Retrieval-augmented generation (RAG) is a powerful technique to facilitate language model with proprietary and private data.
In this work, we conduct empirical studies with novel attack methods, which demonstrate the vulnerability of RAG systems on leaking the private retrieval database.
arXiv Detail & Related papers (2024-02-23T18:35:15Z) - Attack Prompt Generation for Red Teaming and Defending Large Language
Models [70.157691818224]
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content.
We propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts.
arXiv Detail & Related papers (2023-10-19T06:15:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.