Related papers: PrivAgent: Agentic-based Red-teaming for LLM Privacy Leakage

PrivAgent: Agentic-based Red-teaming for LLM Privacy Leakage

URL: http://arxiv.org/abs/2412.05734v1
Date: Sat, 07 Dec 2024 20:09:01 GMT
Title: PrivAgent: Agentic-based Red-teaming for LLM Privacy Leakage
Authors: Yuzhou Nie, Zhun Wang, Ye Yu, Xian Wu, Xuandong Zhao, Wenbo Guo, Dawn Song,
Abstract summary: LLMs may be fooled into outputting private information under carefully crafted adversarial prompts.<n>PrivAgent is a novel black-box red-teaming framework for privacy leakage.
Score: 78.33839735526769
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent studies have discovered that LLMs have serious privacy leakage concerns, where an LLM may be fooled into outputting private information under carefully crafted adversarial prompts. These risks include leaking system prompts, personally identifiable information, training data, and model parameters. Most existing red-teaming approaches for privacy leakage rely on humans to craft the adversarial prompts. A few automated methods are proposed for system prompt extraction, but they cannot be applied to more severe risks (e.g., training data extraction) and have limited effectiveness even for system prompt extraction. In this paper, we propose PrivAgent, a novel black-box red-teaming framework for LLM privacy leakage. We formulate different risks as a search problem with a unified attack goal. Our framework trains an open-source LLM through reinforcement learning as the attack agent to generate adversarial prompts for different target models under different risks. We propose a novel reward function to provide effective and fine-grained rewards for the attack agent. Finally, we introduce customizations to better fit our general framework to system prompt extraction and training data extraction. Through extensive evaluations, we first show that PrivAgent outperforms existing automated methods in system prompt leakage against six popular LLMs. Notably, our approach achieves a 100% success rate in extracting system prompts from real-world applications in OpenAI's GPT Store. We also show PrivAgent's effectiveness in extracting training data from an open-source LLM with a success rate of 5.9%. We further demonstrate PrivAgent's effectiveness in evading the existing guardrail defense and its helpfulness in enabling better safety alignment. Finally, we validate our customized designs through a detailed ablation study. We release our code here https://github.com/rucnyz/RedAgent.

Related papers

Enhancing Jailbreak Attacks on LLMs via Persona Prompts [39.73624426612256]
Jailbreak attacks aim to exploit large language models (LLMs) by inducing them to generate harmful content, thereby revealing their vulnerabilities.<n>Previous jailbreak approaches have mainly focused on direct manipulations of harmful intent, with limited attention to the impact of persona prompts.<n>We propose a genetic algorithm-based method that automatically crafts persona prompts to bypass LLM's safety mechanisms.
arXiv Detail & Related papers (2025-07-28T12:03:22Z)
AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents [54.29555239363013]
We propose a generic black-box fuzzing framework, AgentVigil, to automatically discover and exploit indirect prompt injection vulnerabilities.<n>We evaluate AgentVigil on two public benchmarks, AgentDojo and VWA-adv, where it achieves 71% and 70% success rates against agents based on o3-mini and GPT-4o.<n>We apply our attacks in real-world environments, successfully misleading agents to navigate to arbitrary URLs, including malicious sites.
arXiv Detail & Related papers (2025-05-09T07:40:17Z)
Exploring Backdoor Attack and Defense for LLM-empowered Recommendations [15.098844020816552]
We propose a new attack framework termed Backdoor Injection Poisoning for RecSys (BadRec) BadRec perturbs the items' titles with triggers and employs several fake users to interact with these items, effectively poisoning the training set and injecting backdoors into RecSys. We propose a universal defense strategy called Poison Scanner (P-Scanner) to mitigate such a security threat.
arXiv Detail & Related papers (2025-04-15T13:37:38Z)
CheatAgent: Attacking LLM-Empowered Recommender Systems via LLM Agent [32.958798200220286]
Large Language Model (LLM)-empowered recommender systems (RecSys) have brought significant advances in personalized user experience. We propose a novel attack framework called CheatAgent by harnessing the human-like capabilities of LLMs. Our method first identifies the insertion position for maximum impact with minimal input modification.
arXiv Detail & Related papers (2025-04-13T05:31:37Z)
Unveiling Privacy Risks in LLM Agent Memory [40.26158509307175]
Large Language Model (LLM) agents have become increasingly prevalent across various real-world applications. They enhance decision-making by storing private user-agent interactions in the memory module for demonstrations. We propose a Memory EXTRaction Attack (MEXTRA) to extract private information from memory.
arXiv Detail & Related papers (2025-02-17T19:55:53Z)
Data Extraction Attacks in Retrieval-Augmented Generation via Backdoors [15.861833242429228]
We investigate data extraction attacks targeting RAG's knowledge databases. We show that previous prompt injection-based extraction attacks largely rely on the instruction-following capabilities of LLMs. We propose to backdoor RAG, where a small portion of poisoned data is injected during the fine-tuning phase to create a backdoor within the LLM.
arXiv Detail & Related papers (2024-11-03T22:27:40Z)
Aligning LLMs to Be Robust Against Prompt Injection [55.07562650579068]
We show that alignment can be a powerful tool to make LLMs more robust against prompt injection attacks. Our method -- SecAlign -- first builds an alignment dataset by simulating prompt injection attacks. Our experiments show that SecAlign robustifies the LLM substantially with a negligible hurt on model utility.
arXiv Detail & Related papers (2024-10-07T19:34:35Z)
Evaluating Large Language Model based Personal Information Extraction and Countermeasures [63.91918057570824]
Large language model (LLM) can be misused by attackers to accurately extract various personal information from personal profiles. LLM outperforms conventional methods at such extraction. prompt injection can mitigate such risk to a large extent and outperforms conventional countermeasures.
arXiv Detail & Related papers (2024-08-14T04:49:30Z)
MaPPing Your Model: Assessing the Impact of Adversarial Attacks on LLM-based Programming Assistants [14.947665219536708]
We introduce the Malicious Programming Prompt (MaPP) attack, in which an attacker adds a small amount of text to a prompt for a programming task. We show that our prompt strategy can cause an LLM to add vulnerabilities while continuing to write otherwise correct code.
arXiv Detail & Related papers (2024-07-12T22:30:35Z)
BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents [26.057916556444333]
We show that such methods are vulnerable to our proposed backdoor attacks named BadAgent. Our proposed attack methods are extremely robust even after fine-tuning on trustworthy data.
arXiv Detail & Related papers (2024-06-05T07:14:28Z)
Prompt Leakage effect and defense strategies for multi-turn LLM interactions [95.33778028192593]
Leakage of system prompts may compromise intellectual property and act as adversarial reconnaissance for an attacker. We design a unique threat model which leverages the LLM sycophancy effect and elevates the average attack success rate (ASR) from 17.7% to 86.2% in a multi-turn setting. We measure the mitigation effect of 7 black-box defense strategies, along with finetuning an open-source model to defend against leakage attempts.
arXiv Detail & Related papers (2024-04-24T23:39:58Z)
CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models [6.931433424951554]
Large language models (LLMs) introduce new security risks, but there are few comprehensive evaluation suites to measure and reduce these risks. We present BenchmarkName, a novel benchmark to quantify LLM security risks and capabilities. We evaluate multiple state-of-the-art (SOTA) LLMs, including GPT-4, Mistral, Meta Llama 3 70B-Instruct, and Code Llama.
arXiv Detail & Related papers (2024-04-19T20:11:12Z)
The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG) [56.67603627046346]
Retrieval-augmented generation (RAG) is a powerful technique to facilitate language model with proprietary and private data. In this work, we conduct empirical studies with novel attack methods, which demonstrate the vulnerability of RAG systems on leaking the private retrieval database.
arXiv Detail & Related papers (2024-02-23T18:35:15Z)
Learning to Poison Large Language Models During Instruction Tuning [12.521338629194503]
This work identifies additional security risks in Large Language Models (LLMs) by designing a new data poisoning attack tailored to exploit the instruction tuning process. We propose a novel gradient-guided backdoor trigger learning (GBTL) algorithm to identify adversarial triggers efficiently. We propose two defense strategies against data poisoning attacks, including in-context learning (ICL) and continuous learning (CL)
arXiv Detail & Related papers (2024-02-21T01:30:03Z)
Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models [82.98081731588717]
Integration of large language models with external content exposes applications to indirect prompt injection attacks. We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to evaluate the risk of such attacks. We develop two black-box methods based on prompt learning and a white-box defense method based on fine-tuning with adversarial training.
arXiv Detail & Related papers (2023-12-21T01:08:39Z)
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game [86.66627242073724]
This paper presents a dataset of over 126,000 prompt injection attacks and 46,000 prompt-based "defenses" against prompt injection. To the best of our knowledge, this is currently the largest dataset of human-generated adversarial examples for instruction-following LLMs. We also use the dataset to create a benchmark for resistance to two types of prompt injection, which we refer to as prompt extraction and prompt hijacking.
arXiv Detail & Related papers (2023-11-02T06:13:36Z)
Attack Prompt Generation for Red Teaming and Defending Large Language Models [70.157691818224]
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content. We propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts.
arXiv Detail & Related papers (2023-10-19T06:15:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.