PRSA: PRompt Stealing Attacks against Large Language Models
- URL: http://arxiv.org/abs/2402.19200v2
- Date: Sat, 8 Jun 2024 03:43:12 GMT
- Title: PRSA: PRompt Stealing Attacks against Large Language Models
- Authors: Yong Yang, Changjiang Li, Yi Jiang, Xi Chen, Haoyu Wang, Xuhong Zhang, Zonghui Wang, Shouling Ji,
- Abstract summary: "prompt as a service" has greatly enhanced the utility of large language models (LLMs)
We introduce a novel attack framework, PRSA, designed for prompt stealing attacks against LLMs.
PRSA mainly consists of two key phases: prompt mutation and prompt pruning.
- Score: 42.07328505384544
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, "prompt as a service" has greatly enhanced the utility of large language models (LLMs) by enabling them to perform various downstream tasks efficiently without fine-tuning. This has also increased the commercial value of prompts. However, the potential risk of leakage in these commercialized prompts remains largely underexplored. In this paper, we introduce a novel attack framework, PRSA, designed for prompt stealing attacks against LLMs. The main idea of PRSA is to infer the intent behind a prompt by analyzing its input-output content, enabling the generation of a surrogate prompt that replicates the original's functionality. Specifically, PRSA mainly consists of two key phases: prompt mutation and prompt pruning. In the mutation phase, we propose a prompt attention algorithm based on output difference. The algorithm facilitates the generation of effective surrogate prompts by learning key factors that influence the accurate inference of prompt intent. During the pruning phase, we employ a two-step related word identification strategy to detect and mask words that are highly related to the input, thus improving the generalizability of the surrogate prompts. We verify the actual threat of PRSA through evaluation in both real-world settings, non-interactive and interactive prompt services. The results strongly confirm the PRSA's effectiveness and generalizability. We have reported these findings to prompt service providers and actively collaborate with them to implement defensive measures.
Related papers
- Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers [61.57691030102618]
We propose a novel jailbreaking method, Paper Summary Attack (llmnamePSA)<n>It synthesizes content from either attack-focused or defense-focused LLM safety paper to construct an adversarial prompt template.<n>Experiments show significant vulnerabilities not only in base LLMs, but also in state-of-the-art reasoning model like Deepseek-R1.
arXiv Detail & Related papers (2025-07-17T18:33:50Z) - Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models [69.11679786018206]
Supervised fine-tuning (SFT) aligns large language models with human intent by training them on labeled task-specific data.<n>Recent studies have shown that malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer pairs.<n>We propose a novel clean-data backdoor attack for jailbreaking LLMs.
arXiv Detail & Related papers (2025-05-23T08:13:59Z) - Is Your Prompt Safe? Investigating Prompt Injection Attacks Against Open-Source LLMs [28.75283403986172]
Large Language Models (LLMs) are vulnerable to prompt-based attacks, generating harmful content or sensitive information.<n>This paper studies effective prompt injection attacks against the $mathbf14$ most popular open-source LLMs on five attack benchmarks.
arXiv Detail & Related papers (2025-05-20T13:50:43Z) - Practical Reasoning Interruption Attacks on Reasoning Large Language Models [0.24963930962128378]
Reasoning large language models (RLLMs) have demonstrated outstanding performance across a variety of tasks, yet they also expose numerous security vulnerabilities.<n>Recent work has identified a distinct "thinking-stopped" vulnerability in DeepSeek-R1 under adversarial prompts.<n>We develop a novel prompt injection attack, termed reasoning interruption attack, and offer an initial analysis of its root cause.
arXiv Detail & Related papers (2025-05-10T13:36:01Z) - Token-Efficient Prompt Injection Attack: Provoking Cessation in LLM Reasoning via Adaptive Token Compression [12.215295420714787]
"Reasoning Interruption Attack" is a prompt injection attack based on adaptive token compression.
We develop a systematic approach to efficiently collect attack prompts and an adaptive token compression framework.
Experiments show our compression framework significantly reduces prompt length while maintaining effective attack capabilities.
arXiv Detail & Related papers (2025-04-29T07:34:22Z) - REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective [57.57786477441956]
We propose an adaptive and semantic optimization problem over the population of responses.
Our objective doubles the attack success rate (ASR) on Llama3 and increases the ASR from 2% to 50% with circuit breaker defense.
arXiv Detail & Related papers (2025-02-24T15:34:48Z) - Towards Action Hijacking of Large Language Model-based Agent [23.13653350521422]
We introduce AI$mathbf2$, a novel attack to manipulate the action plans of LLM-based applications.<n>It first collects action-aware knowledge from the victim application.<n>Based on such knowledge, the attacker can generate misleading input, which can mislead the LLM to generate harmful action plans.
arXiv Detail & Related papers (2024-12-14T12:11:26Z) - ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs [72.13489820420726]
ProSA is a framework designed to evaluate and comprehend prompt sensitivity in large language models.
Our study uncovers that prompt sensitivity fluctuates across datasets and models, with larger models exhibiting enhanced robustness.
arXiv Detail & Related papers (2024-10-16T09:38:13Z) - Applying Pre-trained Multilingual BERT in Embeddings for Improved Malicious Prompt Injection Attacks Detection [5.78117257526028]
Large language models (LLMs) are renowned for their exceptional capabilities, and applying to a wide range of applications.
This work focuses the impact of malicious prompt injection attacks which is one of most dangerous vulnerability on real LLMs applications.
It examines to apply various BERT (Bidirectional Representations from Transformers) like multilingual BERT, DistilBert for classifying malicious prompts from legitimate prompts.
arXiv Detail & Related papers (2024-09-20T08:48:51Z) - Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models [15.764672596793352]
We analyze the underlying mechanism of prompt leakage, which we refer to as prompt memorization, and develop corresponding defending strategies.
We find that current LLMs, even those with safety alignments like GPT-4, are highly vulnerable to prompt extraction attacks.
arXiv Detail & Related papers (2024-08-05T12:20:39Z) - Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models [13.225041704917905]
This study unveils an attack mechanism that capitalizes on human conversation strategies to extract harmful information from large language models.
Unlike conventional methods that target explicit malicious responses, our approach delves deeper into the nature of the information provided in responses.
arXiv Detail & Related papers (2024-07-22T06:04:29Z) - Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context [49.13497493053742]
This research explores converting a nonsensical suffix attack into a sensible prompt via a situation-driven contextual re-writing.
We combine an independent, meaningful adversarial insertion and situations derived from movies to check if this can trick an LLM.
Our approach demonstrates that a successful situation-driven attack can be executed on both open-source and proprietary LLMs.
arXiv Detail & Related papers (2024-07-19T19:47:26Z) - Revisiting Backdoor Attacks against Large Vision-Language Models from Domain Shift [104.76588209308666]
This paper explores backdoor attacks in LVLM instruction tuning across mismatched training and testing domains.<n>We introduce a new evaluation dimension, backdoor domain generalization, to assess attack robustness.<n>We propose a multimodal attribution backdoor attack (MABA) that injects domain-agnostic triggers into critical areas.
arXiv Detail & Related papers (2024-06-27T02:31:03Z) - Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications [8.51254190797079]
We introduce the Raccoon benchmark which comprehensively evaluates a model's susceptibility to prompt extraction attacks.
Our novel evaluation method assesses models under both defenseless and defended scenarios.
Our findings highlight universal susceptibility to prompt theft in the absence of defenses, with OpenAI models demonstrating notable resilience when protected.
arXiv Detail & Related papers (2024-06-10T18:57:22Z) - Defending Against Indirect Prompt Injection Attacks With Spotlighting [11.127479817618692]
In common applications, multiple inputs can be processed by concatenating them together into a single stream of text.
Indirect prompt injection attacks take advantage of this vulnerability by embedding adversarial instructions into untrusted data being processed alongside user commands.
We introduce spotlighting, a family of prompt engineering techniques that can be used to improve LLMs' ability to distinguish among multiple sources of input.
arXiv Detail & Related papers (2024-03-20T15:26:23Z) - Prompt Perturbation in Retrieval-Augmented Generation based Large Language Models [9.688626139309013]
Retrieval-Augmented Generation is considered as a means to improve the trustworthiness of text generation from large language models.
In this work, we find that the insertion of even a short prefix to the prompt leads to the generation of outputs far away from factually correct answers.
We introduce a novel optimization technique called Gradient Guided Prompt Perturbation.
arXiv Detail & Related papers (2024-02-11T12:25:41Z) - Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models [79.0183835295533]
We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to assess the risk of such vulnerabilities.<n>Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content.<n>We propose two novel defense mechanisms-boundary awareness and explicit reminder-to address these vulnerabilities in both black-box and white-box settings.
arXiv Detail & Related papers (2023-12-21T01:08:39Z) - Token-Level Adversarial Prompt Detection Based on Perplexity Measures
and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks.
This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs.
We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z) - AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [54.95912006700379]
We introduce AutoDAN, a novel jailbreak attack against aligned Large Language Models.
AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm.
arXiv Detail & Related papers (2023-10-03T19:44:37Z) - On the Security Risks of Knowledge Graph Reasoning [71.64027889145261]
We systematize the security threats to KGR according to the adversary's objectives, knowledge, and attack vectors.
We present ROAR, a new class of attacks that instantiate a variety of such threats.
We explore potential countermeasures against ROAR, including filtering of potentially poisoning knowledge and training with adversarially augmented queries.
arXiv Detail & Related papers (2023-05-03T18:47:42Z) - Exhaustivity and anti-exhaustivity in the RSA framework: Testing the
effect of prior beliefs [68.8204255655161]
We focus on cases when sensitivity to priors leads to counterintuitive predictions of the Rational Speech Act (RSA) framework.
We show that in the baseline RSA model, under certain conditions, anti-exhaustive readings are predicted.
We find no anti-exhaustivity effects, but observed that message choice is sensitive to priors, as predicted by the RSA framework overall.
arXiv Detail & Related papers (2022-02-14T20:35:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.