Related papers: Adversarial versification in portuguese as a jailbreak operator in LLMs

Adversarial versification in portuguese as a jailbreak operator in LLMs

URL: http://arxiv.org/abs/2512.15353v1
Date: Wed, 17 Dec 2025 11:55:45 GMT
Title: Adversarial versification in portuguese as a jailbreak operator in LLMs
Authors: Joao Queiroz,
Abstract summary: Recent evidence shows that the versification of prompts constitutes a highly effective adversarial mechanism against aligned LLMs.<n>The absence of evaluations in Portuguese, a language with high morphosyntactic complexity, constitutes a critical gap.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent evidence shows that the versification of prompts constitutes a highly effective adversarial mechanism against aligned LLMs. The study 'Adversarial poetry as a universal single-turn jailbreak mechanism in large language models' demonstrates that instructions routinely refused in prose become executable when rewritten as verse, producing up to 18 x more safety failures in benchmarks derived from MLCommons AILuminate. Manually written poems reach approximately 62% ASR, and automated versions 43%, with some models surpassing 90% success in single-turn interactions. The effect is structural: systems trained with RLHF, constitutional AI, and hybrid pipelines exhibit consistent degradation under minimal semiotic formal variation. Versification displaces the prompt into sparsely supervised latent regions, revealing guardrails that are excessively dependent on surface patterns. This dissociation between apparent robustness and real vulnerability exposes deep limitations in current alignment regimes. The absence of evaluations in Portuguese, a language with high morphosyntactic complexity, a rich metric-prosodic tradition, and over 250 million speakers, constitutes a critical gap. Experimental protocols must parameterise scansion, metre, and prosodic variation to test vulnerabilities specific to Lusophone patterns, which are currently ignored.

Related papers

Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents [0.0]
We present RLM-JB, an end-to-end jailbreak detection framework built on Recursive Language Models (RLMs)<n>RLM-JB treats detection as a procedure rather than a one-shot classification.<n>On AutoDAN-style adversarial inputs, RLM-JB achieves high detection effectiveness across three LLM backends.
arXiv Detail & Related papers (2026-02-18T15:07:09Z)
STEAD: Robust Provably Secure Linguistic Steganography with Diffusion Language Model [71.35577462669856]
We propose a robust, provably secure linguistic steganography with diffusion language models (DLMs)<n>We introduce error correction strategies, including pseudo-random error correction and neighborhood search correction, during steganographic extraction.
arXiv Detail & Related papers (2026-01-21T08:58:12Z)
Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models [1.5401871453629499]
We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for Large Language Models (LLMs)<n>Across 25 proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%.
arXiv Detail & Related papers (2025-11-19T10:14:08Z)
Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks [0.31984926651189866]
Sentra-Guard is a real-time modular defense system for large language models (LLMs)<n>The framework uses a hybrid architecture with FAISS-indexed SBERT embedding representations that capture the semantic meaning of prompts.<n>It identifies adversarial prompts in both direct and obfuscated attack vectors.
arXiv Detail & Related papers (2025-10-26T11:19:47Z)
Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models [0.0]
camouflaged jailbreaking embeds malicious intent within seemingly benign language to evade existing safety mechanisms.<n>This paper investigates the construction and impact of camouflaged jailbreak prompts, emphasizing their deceptive characteristics and the limitations of traditional keyword-based detection methods.
arXiv Detail & Related papers (2025-09-05T19:57:38Z)
HAMSA: Hijacking Aligned Compact Models via Stealthy Automation [3.7898376145698744]
Large Language Models (LLMs) are susceptible to jailbreak attacks that can elicit harmful outputs despite extensive alignment efforts.<n>We present an automated red-teaming framework that evolves semantically meaningful and stealthy jailbreak prompts for aligned compact LLMs.<n>We evaluate our method on benchmarks in English (In-The-Wild Jailbreak Prompts on LLMs), and a newly curated Arabic one derived from In-The-Wild Jailbreak Prompts on LLMs and annotated by native Arabic linguists.
arXiv Detail & Related papers (2025-08-22T15:57:57Z)
Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities [76.9327488986162]
Existing attacks against multimodal language models (MLLMs) primarily communicate instructions through text accompanied by adversarial images.<n>We exploit the capabilities of MLLMs to interpret non-textual instructions, specifically, adversarial images or audio generated by our novel method, Con Instruction.<n>Our method achieves the highest attack success rates, reaching 81.3% and 86.6% on LLaVA-v1.5 (13B)
arXiv Detail & Related papers (2025-05-31T13:11:14Z)
MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks [85.3303135160762]
MIRAGE is a novel framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models.<n>It achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines.<n>We demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards.
arXiv Detail & Related papers (2025-03-24T20:38:42Z)
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.<n>It reformulates harmful queries into benign reasoning tasks.<n>We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z)
Jailbreaking Large Language Models Through Alignment Vulnerabilities in Out-of-Distribution Settings [57.136748215262884]
We introduce ObscurePrompt for jailbreaking LLMs, inspired by the observed fragile alignments in Out-of-Distribution (OOD) data.<n>We first formulate the decision boundary in the jailbreaking process and then explore how obscure text affects LLM's ethical decision boundary.<n>Our approach substantially improves upon previous methods in terms of attack effectiveness, maintaining efficacy against two prevalent defense mechanisms.
arXiv Detail & Related papers (2024-06-19T16:09:58Z)
TuBA: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning [63.481446315733145]
Cross-lingual backdoor attacks against multilingual large language models (LLMs) are under-explored.<n>Our research focuses on how poisoning the instruction-tuning data for one or two languages can affect the outputs for languages whose instruction-tuning data were not poisoned.<n>Our method exhibits remarkable efficacy in models like mT5 and GPT-4o, with high attack success rates, surpassing 90% in more than 7 out of 12 languages.
arXiv Detail & Related papers (2024-04-30T14:43:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.