Related papers: Evil twins are not that evil: Qualitative insights into machine-generated prompts

Evil twins are not that evil: Qualitative insights into machine-generated prompts

URL: http://arxiv.org/abs/2412.08127v3
Date: Mon, 31 Mar 2025 16:33:26 GMT
Title: Evil twins are not that evil: Qualitative insights into machine-generated prompts
Authors: Nathanaël Carraz Rakotonirina, Corentin Kervadec, Francesca Franzon, Marco Baroni,
Abstract summary: We present the first thorough analysis of opaque machine-generated prompts, or autoprompts.<n>We find that machine-generated prompts are characterized by a last token that is often intelligible and strongly affects the generation.<n>Human experts can reliably identify the most influential tokens in an autoprompt a posteriori, suggesting these prompts are not entirely opaque.
Score: 11.42957674201616
License: http://creativecommons.org/licenses/by/4.0/
Abstract: It has been widely observed that language models (LMs) respond in predictable ways to algorithmically generated prompts that are seemingly unintelligible. This is both a sign that we lack a full understanding of how LMs work, and a practical challenge, because opaqueness can be exploited for harmful uses of LMs, such as jailbreaking. We present the first thorough analysis of opaque machine-generated prompts, or autoprompts, pertaining to 6 LMs of different sizes and families. We find that machine-generated prompts are characterized by a last token that is often intelligible and strongly affects the generation. A small but consistent proportion of the previous tokens are prunable, probably appearing in the prompt as a by-product of the fact that the optimization process fixes the number of tokens. The remaining tokens fall into two categories: filler tokens, which can be replaced with semantically unrelated substitutes, and keywords, that tend to have at least a loose semantic relation with the generation, although they do not engage in well-formed syntactic relations with it. Additionally, human experts can reliably identify the most influential tokens in an autoprompt a posteriori, suggesting these prompts are not entirely opaque. Finally, some of the ablations we applied to autoprompts yield similar effects in natural language inputs, suggesting that autoprompts emerge naturally from the way LMs process linguistic inputs in general.

Related papers

Sampling from Your Language Model One Byte at a Time [82.71473348639489]
Tokenization can introduce distortion into the model's generations, known as the Prompt Boundary Problem (PBP)<n>We present an inference-time method to convert any autore LM with a BPE tokenizer into a character-level or byte-level LM.<n>Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers.
arXiv Detail & Related papers (2025-06-17T02:37:04Z)
Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive. LCD can distort the global distribution over strings, sampling tokens based only on local information. We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z)
Pr$εε$mpt: Sanitizing Sensitive Prompts for LLMs [49.84954577111077]
Pr$epsilonepsilon$mpt is a novel system that implements a prompt sanitizer. We show that Pr$epsilonepsilon$mpt is a practical method to achieve meaningful privacy guarantees.
arXiv Detail & Related papers (2025-04-07T14:52:40Z)
Incremental Sentence Processing Mechanisms in Autoregressive Transformer Language Models [12.866627382118768]
We study the mechanisms underlying garden path sentence processing in LMs.<n>We find that while many important features relate to syntactic structure, some reflect syntactically irrelevants.<n>While most active features correspond to one reading of the sentence, some features correspond to the other, suggesting that LMs assign weight to both possibilities simultaneously.
arXiv Detail & Related papers (2024-12-06T18:54:54Z)
Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers [32.274579719726546]
Tokenization is a crucial step that bridges human-readable text with model-readable discrete tokens. Recent studies have revealed that tokenizers can be exploited to elicit unwanted model behaviors. We investigate incomplete tokens, i.e., undecodable tokens with stray bytes resulting from byte-level byte-pair encoding (BPE) tokenization.
arXiv Detail & Related papers (2024-10-31T07:19:44Z)
SEP: Self-Enhanced Prompt Tuning for Visual-Language Model [93.94454894142413]
We introduce a novel approach named Self-Enhanced Prompt Tuning (SEP) SEP explicitly incorporates discriminative prior knowledge to enhance both textual-level and visual-level embeddings. Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning.
arXiv Detail & Related papers (2024-05-24T13:35:56Z)
Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs [28.58726732808416]
We employ the Greedy Coordinate Gradient to craft prompts that compel large language models to generate coherent responses from seemingly nonsensical inputs. We find that the manipulation efficiency depends on the target text's length and perplexity, with the Babel prompts often located in lower loss minima. Notably, we find that guiding the model to generate harmful texts is not more difficult than into generating benign texts, suggesting lack of alignment for out-of-distribution prompts.
arXiv Detail & Related papers (2024-04-26T02:29:26Z)
Let's Think Dot by Dot: Hidden Computation in Transformer Language Models [30.972412126012884]
Chain-of-thought responses from language models improve performance across most benchmarks. We show that transformers can use meaningless filler tokens in place of a chain of thought to solve two hard algorithmic tasks. We find that learning to use filler tokens is difficult and requires specific, dense supervision to converge.
arXiv Detail & Related papers (2024-04-24T09:30:00Z)
Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z)
An Incomplete Loop: Deductive, Inductive, and Abductive Learning in Large Language Models [99.31449616860291]
Modern language models (LMs) can learn to perform new tasks in different ways. In instruction following, the target task is described explicitly in natural language; in few-shot prompting, the task is specified implicitly. In instruction inference, LMs are presented with in-context examples and are then prompted to generate a natural language task description.
arXiv Detail & Related papers (2024-04-03T19:31:56Z)
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks. This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs. We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z)
BOOST: Harnessing Black-Box Control to Boost Commonsense in LMs' Generation [60.77990074569754]
We present a computation-efficient framework that steers a frozen Pre-Trained Language Model towards more commonsensical generation. Specifically, we first construct a reference-free evaluator that assigns a sentence with a commonsensical score. We then use the scorer as the oracle for commonsense knowledge, and extend the controllable generation method called NADO to train an auxiliary head.
arXiv Detail & Related papers (2023-10-25T23:32:12Z)
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [54.95912006700379]
We introduce AutoDAN, a novel jailbreak attack against aligned Large Language Models. AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm.
arXiv Detail & Related papers (2023-10-03T19:44:37Z)
Extend and Explain: Interpreting Very Long Language Models [0.0]
We introduce a novel Masked Sampling Procedure (MSP) to identify the text blocks that contribute to a prediction. MSP identifies 1.7x more clinically informative text blocks than the previous state-of-the-art, runs up to 100x faster, and is tractable for generating important phrase pairs.
arXiv Detail & Related papers (2022-09-02T17:15:43Z)
Position-based Prompting for Health Outcome Generation [0.0]
We explore an idea of using a position-attention mechanism to capture positional information of each word in a prompt relative to the mask to be filled. Our approach consistently outperforms a baseline in which the default mask language model (MLM) representation is used to predict masked tokens.
arXiv Detail & Related papers (2022-03-30T16:44:04Z)
Reflective Decoding: Beyond Unidirectional Generation with Off-the-Shelf Language Models [63.808843089941405]
Large pretrained LanguageModels (LMs) generate text with remarkable quality, but only sequentially from left to right. We present Reflective Decoding, a novel unsupervised algorithm that allows for direct application of unidirectional LMs to non-sequential tasks. Our 2-step approach requires no supervision or even parallel corpora, only two off-the-shelf pretrained LMs in opposite directions.
arXiv Detail & Related papers (2020-10-16T18:02:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.