PRSA: PRompt Stealing Attacks against Large Language Models
- URL: http://arxiv.org/abs/2402.19200v2
- Date: Sat, 8 Jun 2024 03:43:12 GMT
- Title: PRSA: PRompt Stealing Attacks against Large Language Models
- Authors: Yong Yang, Changjiang Li, Yi Jiang, Xi Chen, Haoyu Wang, Xuhong Zhang, Zonghui Wang, Shouling Ji,
- Abstract summary: "prompt as a service" has greatly enhanced the utility of large language models (LLMs)
We introduce a novel attack framework, PRSA, designed for prompt stealing attacks against LLMs.
PRSA mainly consists of two key phases: prompt mutation and prompt pruning.
- Score: 42.07328505384544
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, "prompt as a service" has greatly enhanced the utility of large language models (LLMs) by enabling them to perform various downstream tasks efficiently without fine-tuning. This has also increased the commercial value of prompts. However, the potential risk of leakage in these commercialized prompts remains largely underexplored. In this paper, we introduce a novel attack framework, PRSA, designed for prompt stealing attacks against LLMs. The main idea of PRSA is to infer the intent behind a prompt by analyzing its input-output content, enabling the generation of a surrogate prompt that replicates the original's functionality. Specifically, PRSA mainly consists of two key phases: prompt mutation and prompt pruning. In the mutation phase, we propose a prompt attention algorithm based on output difference. The algorithm facilitates the generation of effective surrogate prompts by learning key factors that influence the accurate inference of prompt intent. During the pruning phase, we employ a two-step related word identification strategy to detect and mask words that are highly related to the input, thus improving the generalizability of the surrogate prompts. We verify the actual threat of PRSA through evaluation in both real-world settings, non-interactive and interactive prompt services. The results strongly confirm the PRSA's effectiveness and generalizability. We have reported these findings to prompt service providers and actively collaborate with them to implement defensive measures.
Related papers
- ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs [72.13489820420726]
ProSA is a framework designed to evaluate and comprehend prompt sensitivity in large language models.
Our study uncovers that prompt sensitivity fluctuates across datasets and models, with larger models exhibiting enhanced robustness.
arXiv Detail & Related papers (2024-10-16T09:38:13Z) - Applying Pre-trained Multilingual BERT in Embeddings for Improved Malicious Prompt Injection Attacks Detection [5.78117257526028]
Large language models (LLMs) are renowned for their exceptional capabilities, and applying to a wide range of applications.
This work focuses the impact of malicious prompt injection attacks which is one of most dangerous vulnerability on real LLMs applications.
It examines to apply various BERT (Bidirectional Representations from Transformers) like multilingual BERT, DistilBert for classifying malicious prompts from legitimate prompts.
arXiv Detail & Related papers (2024-09-20T08:48:51Z) - Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models [15.764672596793352]
We analyze the underlying mechanism of prompt leakage, which we refer to as prompt memorization, and develop corresponding defending strategies.
We find that current LLMs, even those with safety alignments like GPT-4, are highly vulnerable to prompt extraction attacks.
arXiv Detail & Related papers (2024-08-05T12:20:39Z) - Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models [13.225041704917905]
This study unveils an attack mechanism that capitalizes on human conversation strategies to extract harmful information from large language models.
Unlike conventional methods that target explicit malicious responses, our approach delves deeper into the nature of the information provided in responses.
arXiv Detail & Related papers (2024-07-22T06:04:29Z) - Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context [49.13497493053742]
This research explores converting a nonsensical suffix attack into a sensible prompt via a situation-driven contextual re-writing.
We combine an independent, meaningful adversarial insertion and situations derived from movies to check if this can trick an LLM.
Our approach demonstrates that a successful situation-driven attack can be executed on both open-source and proprietary LLMs.
arXiv Detail & Related papers (2024-07-19T19:47:26Z) - Defending Against Indirect Prompt Injection Attacks With Spotlighting [11.127479817618692]
In common applications, multiple inputs can be processed by concatenating them together into a single stream of text.
Indirect prompt injection attacks take advantage of this vulnerability by embedding adversarial instructions into untrusted data being processed alongside user commands.
We introduce spotlighting, a family of prompt engineering techniques that can be used to improve LLMs' ability to distinguish among multiple sources of input.
arXiv Detail & Related papers (2024-03-20T15:26:23Z) - Prompt Perturbation in Retrieval-Augmented Generation based Large Language Models [9.688626139309013]
Retrieval-Augmented Generation is considered as a means to improve the trustworthiness of text generation from large language models.
In this work, we find that the insertion of even a short prefix to the prompt leads to the generation of outputs far away from factually correct answers.
We introduce a novel optimization technique called Gradient Guided Prompt Perturbation.
arXiv Detail & Related papers (2024-02-11T12:25:41Z) - Token-Level Adversarial Prompt Detection Based on Perplexity Measures
and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks.
This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs.
We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z) - AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [54.95912006700379]
We introduce AutoDAN, a novel jailbreak attack against aligned Large Language Models.
AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm.
arXiv Detail & Related papers (2023-10-03T19:44:37Z) - Exhaustivity and anti-exhaustivity in the RSA framework: Testing the
effect of prior beliefs [68.8204255655161]
We focus on cases when sensitivity to priors leads to counterintuitive predictions of the Rational Speech Act (RSA) framework.
We show that in the baseline RSA model, under certain conditions, anti-exhaustive readings are predicted.
We find no anti-exhaustivity effects, but observed that message choice is sensitive to priors, as predicted by the RSA framework overall.
arXiv Detail & Related papers (2022-02-14T20:35:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.