Prompt Stealing Attacks Against Large Language Models
- URL: http://arxiv.org/abs/2402.12959v1
- Date: Tue, 20 Feb 2024 12:25:26 GMT
- Title: Prompt Stealing Attacks Against Large Language Models
- Authors: Zeyang Sha and Yang Zhang
- Abstract summary: We propose a novel attack against large language models (LLMs)
Our proposed prompt stealing attack aims to steal these well-designed prompts based on the generated answers.
Our experimental results show the remarkable performance of our proposed attacks.
- Score: 5.421974542780941
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The increasing reliance on large language models (LLMs) such as ChatGPT in
various fields emphasizes the importance of ``prompt engineering,'' a
technology to improve the quality of model outputs. With companies investing
significantly in expert prompt engineers and educational resources rising to
meet market demand, designing high-quality prompts has become an intriguing
challenge. In this paper, we propose a novel attack against LLMs, named prompt
stealing attacks. Our proposed prompt stealing attack aims to steal these
well-designed prompts based on the generated answers. The prompt stealing
attack contains two primary modules: the parameter extractor and the prompt
reconstruction. The goal of the parameter extractor is to figure out the
properties of the original prompts. We first observe that most prompts fall
into one of three categories: direct prompt, role-based prompt, and in-context
prompt. Our parameter extractor first tries to distinguish the type of prompts
based on the generated answers. Then, it can further predict which role or how
many contexts are used based on the types of prompts. Following the parameter
extractor, the prompt reconstructor can be used to reconstruct the original
prompts based on the generated answers and the extracted features. The final
goal of the prompt reconstructor is to generate the reversed prompts, which are
similar to the original prompts. Our experimental results show the remarkable
performance of our proposed attacks. Our proposed attacks add a new dimension
to the study of prompt engineering and call for more attention to the security
issues on LLMs.
Related papers
- Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models [15.764672596793352]
We analyze the underlying mechanism of prompt leakage, which we refer to as prompt memorization, and develop corresponding defending strategies.
We find that current LLMs, even those with safety alignments like GPT-4, are highly vulnerable to prompt extraction attacks.
arXiv Detail & Related papers (2024-08-05T12:20:39Z) - Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context [49.13497493053742]
This research explores converting a nonsensical suffix attack into a sensible prompt via a situation-driven contextual re-writing.
We combine an independent, meaningful adversarial insertion and situations derived from movies to check if this can trick an LLM.
Our approach demonstrates that a successful situation-driven attack can be executed on both open-source and proprietary LLMs.
arXiv Detail & Related papers (2024-07-19T19:47:26Z) - AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs [51.217126257318924]
We present a novel method that uses another Large Language Models, called the AdvPrompter, to generate human-readable adversarial prompts in seconds.
We train the AdvPrompter using a novel algorithm that does not require access to the gradients of the TargetLLM.
The trained AdvPrompter generates suffixes that veil the input instruction without changing its meaning, such that the TargetLLM is lured to give a harmful response.
arXiv Detail & Related papers (2024-04-21T22:18:13Z) - PRSA: PRompt Stealing Attacks against Large Language Models [42.07328505384544]
"prompt as a service" has greatly enhanced the utility of large language models (LLMs)
We introduce a novel attack framework, PRSA, designed for prompt stealing attacks against LLMs.
PRSA mainly consists of two key phases: prompt mutation and prompt pruning.
arXiv Detail & Related papers (2024-02-29T14:30:28Z) - Defending LLMs against Jailbreaking Attacks via Backtranslation [61.878363293735624]
We propose a new method for defending LLMs against jailbreaking attacks by backtranslation''
The inferred prompt is called the backtranslated prompt which tends to reveal the actual intent of the original prompt.
We empirically demonstrate that our defense significantly outperforms the baselines.
arXiv Detail & Related papers (2024-02-26T10:03:33Z) - DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM
Jailbreakers [80.18953043605696]
We introduce an automatic prompt textbfDecomposition and textbfReconstruction framework for jailbreak textbfAttack (DrAttack)
DrAttack includes three key components: (a) Decomposition' of the original prompt into sub-prompts, (b) Reconstruction' of these sub-prompts implicitly by in-context learning with semantically similar but harmless reassembling demo, and (c) a Synonym Search' of sub-prompts, aiming to find sub-prompts' synonyms that maintain the original intent while
arXiv Detail & Related papers (2024-02-25T17:43:29Z) - Gradient-Based Language Model Red Teaming [9.972783485792885]
Red teaming is a strategy for identifying weaknesses in generative language models (LMs)
Red teaming is instrumental for both model alignment and evaluation, but is labor-intensive and difficult to scale when done by humans.
We present Gradient-Based Red Teaming (GBRT), a red teaming method for automatically generating diverse prompts that are likely to cause an LM to output unsafe responses.
arXiv Detail & Related papers (2024-01-30T01:19:25Z) - Effective Prompt Extraction from Language Models [70.00099540536382]
We present a framework for measuring the effectiveness of prompt extraction attacks.
In experiments with 3 different sources of prompts and 11 underlying large language models, we find that simple text-based attacks can in fact reveal prompts with high probability.
Our framework determines with high precision whether an extracted prompt is the actual secret prompt, rather than a model hallucination.
arXiv Detail & Related papers (2023-07-13T16:15:08Z) - Prompt Stealing Attacks Against Text-to-Image Generation Models [27.7826502104361]
A trend of trading high-quality prompts on specialized marketplaces has emerged.
Successful prompt stealing attacks directly violate the intellectual property of prompt engineers.
We propose a simple yet effective prompt stealing attack, PromptStealer.
arXiv Detail & Related papers (2023-02-20T11:37:28Z) - Demystifying Prompts in Language Models via Perplexity Estimation [109.59105230163041]
Performance of a prompt is coupled with the extent to which the model is familiar with the language it contains.
We show that the lower the perplexity of the prompt is, the better the prompt is able to perform the task.
arXiv Detail & Related papers (2022-12-08T02:21:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.