Related papers: Prompt Stealing Attacks Against Large Language Models

Prompt Stealing Attacks Against Large Language Models

URL: http://arxiv.org/abs/2402.12959v1
Date: Tue, 20 Feb 2024 12:25:26 GMT
Title: Prompt Stealing Attacks Against Large Language Models
Authors: Zeyang Sha and Yang Zhang
Abstract summary: We propose a novel attack against large language models (LLMs) Our proposed prompt stealing attack aims to steal these well-designed prompts based on the generated answers. Our experimental results show the remarkable performance of our proposed attacks.
Score: 5.421974542780941
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The increasing reliance on large language models (LLMs) such as ChatGPT in various fields emphasizes the importance of ``prompt engineering,'' a technology to improve the quality of model outputs. With companies investing significantly in expert prompt engineers and educational resources rising to meet market demand, designing high-quality prompts has become an intriguing challenge. In this paper, we propose a novel attack against LLMs, named prompt stealing attacks. Our proposed prompt stealing attack aims to steal these well-designed prompts based on the generated answers. The prompt stealing attack contains two primary modules: the parameter extractor and the prompt reconstruction. The goal of the parameter extractor is to figure out the properties of the original prompts. We first observe that most prompts fall into one of three categories: direct prompt, role-based prompt, and in-context prompt. Our parameter extractor first tries to distinguish the type of prompts based on the generated answers. Then, it can further predict which role or how many contexts are used based on the types of prompts. Following the parameter extractor, the prompt reconstructor can be used to reconstruct the original prompts based on the generated answers and the extracted features. The final goal of the prompt reconstructor is to generate the reversed prompts, which are similar to the original prompts. Our experimental results show the remarkable performance of our proposed attacks. Our proposed attacks add a new dimension to the study of prompt engineering and call for more attention to the security issues on LLMs.

Related papers

The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation [40.73687553764341]
Text-to-video (T2V) generative models, trained on large-scale datasets, are sensitive to input prompts. We introduce textbfRAPO, a novel textbfRetrieval-textbfAugmented textbfPrompt textbfOptimization framework.
arXiv Detail & Related papers (2025-04-16T03:33:25Z)
UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models [30.139590566956077]
Large Language Models (LLMs) are vulnerable to attacks like prompt injection, backdoor attacks, and adversarial attacks. We propose UniGuardian, the first unified defense mechanism designed to detect prompt injection, backdoor attacks, and adversarial attacks in LLMs.
arXiv Detail & Related papers (2025-02-18T18:59:00Z)
Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models [15.764672596793352]
We analyze the underlying mechanism of prompt leakage, which we refer to as prompt memorization, and develop corresponding defending strategies. We find that current LLMs, even those with safety alignments like GPT-4, are highly vulnerable to prompt extraction attacks.
arXiv Detail & Related papers (2024-08-05T12:20:39Z)
Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context [49.13497493053742]
This research explores converting a nonsensical suffix attack into a sensible prompt via a situation-driven contextual re-writing. We combine an independent, meaningful adversarial insertion and situations derived from movies to check if this can trick an LLM. Our approach demonstrates that a successful situation-driven attack can be executed on both open-source and proprietary LLMs.
arXiv Detail & Related papers (2024-07-19T19:47:26Z)
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs [51.217126257318924]
We present a novel method that uses another Large Language Models, called the AdvPrompter, to generate human-readable adversarial prompts in seconds. We train the AdvPrompter using a novel algorithm that does not require access to the gradients of the TargetLLM. The trained AdvPrompter generates suffixes that veil the input instruction without changing its meaning, such that the TargetLLM is lured to give a harmful response.
arXiv Detail & Related papers (2024-04-21T22:18:13Z)
PRSA: PRompt Stealing Attacks against Large Language Models [42.07328505384544]
"prompt as a service" has greatly enhanced the utility of large language models (LLMs) We introduce a novel attack framework, PRSA, designed for prompt stealing attacks against LLMs. PRSA mainly consists of two key phases: prompt mutation and prompt pruning.
arXiv Detail & Related papers (2024-02-29T14:30:28Z)
Defending LLMs against Jailbreaking Attacks via Backtranslation [61.878363293735624]
We propose a new method for defending LLMs against jailbreaking attacks by backtranslation'' The inferred prompt is called the backtranslated prompt which tends to reveal the actual intent of the original prompt. We empirically demonstrate that our defense significantly outperforms the baselines.
arXiv Detail & Related papers (2024-02-26T10:03:33Z)
DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers [74.7446827091938]
We introduce an automatic prompt textbfDecomposition and textbfReconstruction framework for jailbreak textbfAttack (DrAttack) DrAttack includes three key components: (a) Decomposition' of the original prompt into sub-prompts, (b) Reconstruction' of these sub-prompts implicitly by in-context learning with semantically similar but harmless reassembling demo, and (c) a Synonym Search' of sub-prompts, aiming to find sub-prompts' synonyms that maintain the original intent while
arXiv Detail & Related papers (2024-02-25T17:43:29Z)
Gradient-Based Language Model Red Teaming [9.972783485792885]
Red teaming is a strategy for identifying weaknesses in generative language models (LMs) Red teaming is instrumental for both model alignment and evaluation, but is labor-intensive and difficult to scale when done by humans. We present Gradient-Based Red Teaming (GBRT), a red teaming method for automatically generating diverse prompts that are likely to cause an LM to output unsafe responses.
arXiv Detail & Related papers (2024-01-30T01:19:25Z)
Effective Prompt Extraction from Language Models [70.00099540536382]
We present a framework for measuring the effectiveness of prompt extraction attacks. In experiments with 3 different sources of prompts and 11 underlying large language models, we find that simple text-based attacks can in fact reveal prompts with high probability. Our framework determines with high precision whether an extracted prompt is the actual secret prompt, rather than a model hallucination.
arXiv Detail & Related papers (2023-07-13T16:15:08Z)
Prompt Stealing Attacks Against Text-to-Image Generation Models [27.7826502104361]
A trend of trading high-quality prompts on specialized marketplaces has emerged. Successful prompt stealing attacks directly violate the intellectual property of prompt engineers. We propose a simple yet effective prompt stealing attack, PromptStealer.
arXiv Detail & Related papers (2023-02-20T11:37:28Z)
Demystifying Prompts in Language Models via Perplexity Estimation [109.59105230163041]
Performance of a prompt is coupled with the extent to which the model is familiar with the language it contains. We show that the lower the perplexity of the prompt is, the better the prompt is able to perform the task.
arXiv Detail & Related papers (2022-12-08T02:21:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.