Related papers: Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

URL: http://arxiv.org/abs/2511.15304v2
Date: Thu, 20 Nov 2025 03:34:44 GMT
Title: Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models
Authors: Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Francesco Giarrusso, Marcantonio Bracale, Marcello Galisai, Vincenzo Suriani, Olga Sorokoletova, Federico Sartore, Daniele Nardi,
Abstract summary: We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for Large Language Models (LLMs)<n>Across 25 proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%.
Score: 1.5401871453629499
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for Large Language Models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of 3 open-weight LLM judges, whose binary safety assessments were validated on a stratified human-labeled subset. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.

Related papers

Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents [0.0]
We present RLM-JB, an end-to-end jailbreak detection framework built on Recursive Language Models (RLMs)<n>RLM-JB treats detection as a procedure rather than a one-shot classification.<n>On AutoDAN-style adversarial inputs, RLM-JB achieves high detection effectiveness across three LLM backends.
arXiv Detail & Related papers (2026-02-18T15:07:09Z)
Adversarial versification in portuguese as a jailbreak operator in LLMs [0.0]
Recent evidence shows that the versification of prompts constitutes a highly effective adversarial mechanism against aligned LLMs.<n>The absence of evaluations in Portuguese, a language with high morphosyntactic complexity, constitutes a critical gap.
arXiv Detail & Related papers (2025-12-17T11:55:45Z)
RoguePrompt: Dual-Layer Ciphering for Self-Reconstruction to Circumvent LLM Moderation [0.0]
This paper presents an automated jailbreak attack that converts a disallowed user query into a self reconstructing prompt.<n>We instantiate RoguePrompt against GPT 4o and evaluate it on 2 448 prompts that a production moderation system previously marked as strongly rejected.<n>Under an evaluation protocol that separates three security relevant outcomes bypass, reconstruction, and execution the attack attains 84.7 percent bypass, 80.2 percent reconstruction, and 71.5 percent full execution.
arXiv Detail & Related papers (2025-11-24T05:42:54Z)
Beyond Text: Multimodal Jailbreaking of Vision-Language and Audio Models through Perceptually Simple Transformations [0.0]
Multimodal large language models (MLLMs) have achieved remarkable progress, yet remain critically vulnerable to adversarial attacks.<n>We present a systematic study of multimodal jailbreaks targeting both vision-language and audio-language models.<n>Our evaluation spans 1,900 adversarial prompts across three high-risk safety categories.
arXiv Detail & Related papers (2025-10-23T05:16:33Z)
Red Teaming Multimodal Language Models: Evaluating Harm Across Prompt Modalities and Models [0.0]
Multimodal large language models (MLLMs) are increasingly used in real world applications, yet their safety under adversarial conditions remains underexplored.<n>This study evaluates the harmlessness of four leading MLLMs when exposed to adversarial prompts across text-only and multimodal formats.
arXiv Detail & Related papers (2025-09-18T22:51:06Z)
Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses [4.706534644850809]
Two primary inference-phase threats are token-level and prompt-level jailbreaks.<n>We propose two hybrid approaches that integrate token- and prompt-level techniques to enhance jailbreak effectiveness across diverse PTLMs.
arXiv Detail & Related papers (2025-06-27T07:26:33Z)
Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities [76.9327488986162]
Existing attacks against multimodal language models (MLLMs) primarily communicate instructions through text accompanied by adversarial images.<n>We exploit the capabilities of MLLMs to interpret non-textual instructions, specifically, adversarial images or audio generated by our novel method, Con Instruction.<n>Our method achieves the highest attack success rates, reaching 81.3% and 86.6% on LLaVA-v1.5 (13B)
arXiv Detail & Related papers (2025-05-31T13:11:14Z)
Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs [83.11815479874447]
We propose a novel jailbreak attack framework, inspired by cognitive decomposition and biases in human cognition.<n>We employ cognitive decomposition to reduce the complexity of malicious prompts and relevance bias to reorganize prompts.<n>We also introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm.
arXiv Detail & Related papers (2025-05-03T05:28:11Z)
MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks [85.3303135160762]
MIRAGE is a novel framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models.<n>It achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines.<n>We demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards.
arXiv Detail & Related papers (2025-03-24T20:38:42Z)
M2S: Multi-turn to Single-turn jailbreak in Red Teaming for LLMs [8.91993614197627]
We introduce a novel framework for consolidating multi-turn adversarial jailbreak'' prompts into single-turn queries.<n>Our multi-turn-to-single-turn (M2S) methods systematically reformat multi-turn dialogues into structured single-turn prompts.<n>Remarkably, the single-turn prompts outperform the original multi-turn attacks by as much as 17.5 percentage points.
arXiv Detail & Related papers (2025-03-06T07:34:51Z)
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.<n>It reformulates harmful queries into benign reasoning tasks.<n>We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z)
Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions [50.40122190627256]
We introduce POATE, a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses.<n>PoATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety.<n>To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses.
arXiv Detail & Related papers (2025-01-03T15:40:03Z)
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.<n>It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.<n>Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.