Related papers: Effective Prompt Extraction from Language Models

Effective Prompt Extraction from Language Models

URL: http://arxiv.org/abs/2307.06865v2
Date: Sat, 17 Feb 2024 23:44:05 GMT
Title: Effective Prompt Extraction from Language Models
Authors: Yiming Zhang and Nicholas Carlini and Daphne Ippolito
Abstract summary: We present a framework for measuring the effectiveness of prompt extraction attacks. In experiments with 3 different sources of prompts and 11 underlying large language models, we find that simple text-based attacks can in fact reveal prompts with high probability. Our framework determines with high precision whether an extracted prompt is the actual secret prompt, rather than a model hallucination.
Score: 78.67410369494623
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The text generated by large language models is commonly controlled by prompting, where a prompt prepended to a user's query guides the model's output. The prompts used by companies to guide their models are often treated as secrets, to be hidden from the user making the query. They have even been treated as commodities to be bought and sold. However, anecdotal reports have shown adversarial users employing prompt extraction attacks to recover these prompts. In this paper, we present a framework for systematically measuring the effectiveness of these attacks. In experiments with 3 different sources of prompts and 11 underlying large language models, we find that simple text-based attacks can in fact reveal prompts with high probability. Our framework determines with high precision whether an extracted prompt is the actual secret prompt, rather than a model hallucination. Prompt extraction experiments on real systems such as Bing Chat and ChatGPT suggest that system prompts can be revealed by an adversary despite existing defenses in place.

Related papers

Has My System Prompt Been Used? Large Language Model Prompt Membership Inference [56.20586932251531]
We develop Prompt Detective, a statistical method to reliably determine whether a given system prompt was used by a third-party language model. Our work reveals that even minor changes in system prompts manifest in distinct response distributions, enabling us to verify prompt usage with statistical significance.
arXiv Detail & Related papers (2025-02-14T08:00:42Z)
Remote Timing Attacks on Efficient Language Model Inference [63.79839291641793]
We show it is possible to exploit timing differences to mount a timing attack. We show how it is possible to learn the topic of a user's conversation with 90%+ precision. An adversary can leverage a boosting attack to recover PII placed in messages for open source systems.
arXiv Detail & Related papers (2024-10-22T16:51:36Z)
Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models [15.764672596793352]
We analyze the underlying mechanism of prompt leakage, which we refer to as prompt memorization, and develop corresponding defending strategies. We find that current LLMs, even those with safety alignments like GPT-4, are highly vulnerable to prompt extraction attacks.
arXiv Detail & Related papers (2024-08-05T12:20:39Z)
DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers [74.7446827091938]
We introduce an automatic prompt textbfDecomposition and textbfReconstruction framework for jailbreak textbfAttack (DrAttack) DrAttack includes three key components: (a) Decomposition' of the original prompt into sub-prompts, (b) Reconstruction' of these sub-prompts implicitly by in-context learning with semantically similar but harmless reassembling demo, and (c) a Synonym Search' of sub-prompts, aiming to find sub-prompts' synonyms that maintain the original intent while
arXiv Detail & Related papers (2024-02-25T17:43:29Z)
Prompt Stealing Attacks Against Large Language Models [5.421974542780941]
We propose a novel attack against large language models (LLMs) Our proposed prompt stealing attack aims to steal these well-designed prompts based on the generated answers. Our experimental results show the remarkable performance of our proposed attacks.
arXiv Detail & Related papers (2024-02-20T12:25:26Z)
Can discrete information extraction prompts generalize across language models? [36.85568212975316]
We study whether automatically-induced prompts can also be used, out-of-the-box, to probe other language models for the same information. We introduce a way to induce prompts by mixing language models at training time that results in prompts that generalize well across models.
arXiv Detail & Related papers (2023-02-20T09:56:51Z)
Toward Human Readable Prompt Tuning: Kubrick's The Shining is a good movie, and a good prompt too? [84.91689960190054]
Large language models can perform new tasks in a zero-shot fashion, given natural language prompts. It is underexplored what factors make the prompts effective, especially when the prompts are natural language.
arXiv Detail & Related papers (2022-12-20T18:47:13Z)
Optimizing Prompts for Text-to-Image Generation [97.61295501273288]
Well-designed prompts can guide text-to-image models to generate amazing images. But the performant prompts are often model-specific and misaligned with user input. We propose prompt adaptation, a framework that automatically adapts original user input to model-preferred prompts.
arXiv Detail & Related papers (2022-12-19T16:50:41Z)
Demystifying Prompts in Language Models via Perplexity Estimation [109.59105230163041]
Performance of a prompt is coupled with the extent to which the model is familiar with the language it contains. We show that the lower the perplexity of the prompt is, the better the prompt is able to perform the task.
arXiv Detail & Related papers (2022-12-08T02:21:47Z)
Ignore Previous Prompt: Attack Techniques For Language Models [0.0]
We propose PromptInject, a framework for mask-based adversarial prompt composition. We show how GPT-3, the most widely deployed language model in production, can be easily misaligned by simple handcrafted inputs.
arXiv Detail & Related papers (2022-11-17T13:43:20Z)
Do Prompts Solve NLP Tasks Using Natural Language? [18.611748762251494]
In this work, we empirically compare the three types of prompts under both few-shot and fully-supervised settings. Our experimental results show that schema prompts are the most effective in general.
arXiv Detail & Related papers (2022-03-02T07:20:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.