Prompts have evil twins
- URL: http://arxiv.org/abs/2311.07064v3
- Date: Sun, 06 Oct 2024 23:53:34 GMT
- Title: Prompts have evil twins
- Authors: Rimon Melamed, Lucas H. McCabe, Tanay Wakhare, Yejin Kim, H. Howie Huang, Enric Boix-Adsera,
- Abstract summary: We call these prompts "evil twins" because they are obfuscated and uninterpretable (evil)
We find these prompts by solving a maximum-likelihood problem which has applications of independent interest.
- Score: 3.043247652016184
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We discover that many natural-language prompts can be replaced by corresponding prompts that are unintelligible to humans but that provably elicit similar behavior in language models. We call these prompts "evil twins" because they are obfuscated and uninterpretable (evil), but at the same time mimic the functionality of the original natural-language prompts (twins). Remarkably, evil twins transfer between models. We find these prompts by solving a maximum-likelihood problem which has applications of independent interest.
Related papers
- Evil twins are not that evil: Qualitative insights into machine-generated prompts [11.42957674201616]
We present the first thorough analysis of opaque machine-generated prompts, or autoprompts.
We find that machine-generated prompts are characterized by a last token that is often intelligible and strongly affects the generation.
Human experts can reliably identify the most influential tokens in an autoprompt a posteriori, suggesting these prompts are not entirely opaque.
arXiv Detail & Related papers (2024-12-11T06:22:44Z) - Models Can and Should Embrace the Communicative Nature of Human-Generated Math [13.491107542643839]
We argue that math data that models are trained on reflects not just idealized mathematical entities but rich communicative intentions.
We advocate for AI systems that learn from and represent the communicative intentions latent in human-generated math.
arXiv Detail & Related papers (2024-09-25T15:08:08Z) - ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Likely Toxic Prompts [33.774939728834156]
We propose a reinforcement learning formulation of the red-teaming task that allows us to discover prompts that trigger toxic outputs from a frozen defender.
We show that our policy is capable of generating likely (low-perplexity) prompts that also trigger toxicity from GPT-2, GPT-2 XL, and TinyLlama defenders.
arXiv Detail & Related papers (2024-07-12T17:33:34Z) - Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs [28.58726732808416]
We employ the Greedy Coordinate Gradient to craft prompts that compel large language models to generate coherent responses from seemingly nonsensical inputs.
We find that the manipulation efficiency depends on the target text's length and perplexity, with the Babel prompts often located in lower loss minima.
Notably, we find that guiding the model to generate harmful texts is not more difficult than into generating benign texts, suggesting lack of alignment for out-of-distribution prompts.
arXiv Detail & Related papers (2024-04-26T02:29:26Z) - An Incomplete Loop: Deductive, Inductive, and Abductive Learning in Large Language Models [99.31449616860291]
Modern language models (LMs) can learn to perform new tasks in different ways.
In instruction following, the target task is described explicitly in natural language; in few-shot prompting, the task is specified implicitly.
In instruction inference, LMs are presented with in-context examples and are then prompted to generate a natural language task description.
arXiv Detail & Related papers (2024-04-03T19:31:56Z) - Frontier Language Models are not Robust to Adversarial Arithmetic, or
"What do I need to say so you agree 2+2=5? [88.59136033348378]
We study the problem of adversarial arithmetic, which provides a simple yet challenging testbed for language model alignment.
This problem is comprised of arithmetic questions posed in natural language, with an arbitrary adversarial string inserted before the question is complete.
We show that models can be partially hardened against these attacks via reinforcement learning and via agentic constitutional loops.
arXiv Detail & Related papers (2023-11-08T19:07:10Z) - Effective Prompt Extraction from Language Models [70.00099540536382]
We present a framework for measuring the effectiveness of prompt extraction attacks.
In experiments with 3 different sources of prompts and 11 underlying large language models, we find that simple text-based attacks can in fact reveal prompts with high probability.
Our framework determines with high precision whether an extracted prompt is the actual secret prompt, rather than a model hallucination.
arXiv Detail & Related papers (2023-07-13T16:15:08Z) - Demystifying Prompts in Language Models via Perplexity Estimation [109.59105230163041]
Performance of a prompt is coupled with the extent to which the model is familiar with the language it contains.
We show that the lower the perplexity of the prompt is, the better the prompt is able to perform the task.
arXiv Detail & Related papers (2022-12-08T02:21:47Z) - Is the Elephant Flying? Resolving Ambiguities in Text-to-Image
Generative Models [64.58271886337826]
We study ambiguities that arise in text-to-image generative models.
We propose a framework to mitigate ambiguities in the prompts given to the systems by soliciting clarifications from the user.
arXiv Detail & Related papers (2022-11-17T17:12:43Z) - Discovering the Hidden Vocabulary of DALLE-2 [96.19666636109729]
We find that DALLE-2 seems to have a hidden vocabulary that can be used to generate images with absurd prompts.
For example, it seems that textttApoploe vesrreaitais means birds and textttContarra ccetnxniams luryca tanniounons (sometimes) means bugs or pests.
arXiv Detail & Related papers (2022-06-01T01:14:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.