Related papers: Say It Differently: Linguistic Styles as Jailbreak Vectors

Say It Differently: Linguistic Styles as Jailbreak Vectors

URL: http://arxiv.org/abs/2511.10519v1
Date: Fri, 14 Nov 2025 01:55:59 GMT
Title: Say It Differently: Linguistic Styles as Jailbreak Vectors
Authors: Srikant Panda, Avinash Rai,
Abstract summary: We study how linguistic styles such as fear or curiosity can reframe harmful intent and elicit unsafe responses from aligned models.<n>We construct style-augmented jailbreak benchmark by transforming prompts from 3 standard datasets into 11 distinct linguistic styles.<n>Styles such as fearful, curious and compassionate are most effective and contextualized rewrites outperform templated variants.
Score: 0.763334557068953
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are commonly evaluated for robustness against paraphrased or semantically equivalent jailbreak prompts, yet little attention has been paid to linguistic variation as an attack surface. In this work, we systematically study how linguistic styles such as fear or curiosity can reframe harmful intent and elicit unsafe responses from aligned models. We construct style-augmented jailbreak benchmark by transforming prompts from 3 standard datasets into 11 distinct linguistic styles using handcrafted templates and LLM-based rewrites, while preserving semantic intent. Evaluating 16 open- and close-source instruction-tuned models, we find that stylistic reframing increases jailbreak success rates by up to +57 percentage points. Styles such as fearful, curious and compassionate are most effective and contextualized rewrites outperform templated variants. To mitigate this, we introduce a style neutralization preprocessing step using a secondary LLM to strip manipulative stylistic cues from user inputs, significantly reducing jailbreak success rates. Our findings reveal a systemic and scaling-resistant vulnerability overlooked in current safety pipelines.

Related papers

Imperceptible Jailbreaking against Large Language Models [107.76039200173528]
We introduce imperceptible jailbreaks that exploit a class of Unicode characters called variation selectors.<n>By appending invisible variation selectors to malicious questions, the jailbreak prompts appear visually identical to original malicious questions on screen.<n>We propose a chain-of-search pipeline to generate such adversarial suffixes to induce harmful responses.
arXiv Detail & Related papers (2025-10-06T17:03:50Z)
Anyone Can Jailbreak: Prompt-Based Attacks on LLMs and T2Is [8.214994509812724]
Large language models (LLMs) and text-to-image (T2I) systems remain vulnerable to prompt-based attacks known as jailbreaks.<n>This paper presents a systems-style investigation into how non-experts reliably circumvent safety mechanisms.<n>We propose a unified taxonomy of prompt-level jailbreak strategies spanning both text-output and T2I models.
arXiv Detail & Related papers (2025-07-29T13:55:23Z)
InfoFlood: Jailbreaking Large Language Models with Information Overload [16.626185161464164]
We identify a new vulnerability in which excessive linguistic complexity can disrupt built-in safety mechanisms.<n>We propose InfoFlood, a jailbreak attack that transforms malicious queries into complex, information-overloaded queries.<n>We empirically validate the effectiveness of InfoFlood on four widely used LLMs-GPT-4o, GPT-3.5-turbo, Gemini 2.0, and LLaMA 3.1.
arXiv Detail & Related papers (2025-06-13T23:03:11Z)
When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment [21.638179430757116]
Large language models (LLMs) can be prompted with specific styles, including in malicious queries.<n>The impact of style patterns in the original queries that are semantically irrelevant to the malicious intent remains unclear.<n>We propose SafeStyle, a defense strategy that incorporates a small amount of safety training data augmented to match the distribution of style patterns in the fine-tuning data.
arXiv Detail & Related papers (2025-06-09T05:57:39Z)
"Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs [1.2891210250935148]
We introduce a novel strategy that leverages code-mixing and phonetic perturbations to jailbreak LLMs for both text and image generation tasks.<n>Our work presents a method to effectively bypass safety filters in LLMs while maintaining interpretability by applying phonetic misspellings to sensitive words in code-mixed prompts.<n>Our interpretability experiments reveal that phonetic perturbations impact word tokenization, leading to jailbreak success.
arXiv Detail & Related papers (2025-05-20T11:35:25Z)
CCJA: Context-Coherent Jailbreak Attack for Aligned Large Language Models [18.06388944779541]
"jailbreaking" is the use of large language models to trigger unintended behaviors.<n>We propose a novel method to balance the jailbreak attack success rate with semantic coherence.<n>Our method is superior to state-of-the-art baselines in attack effectiveness.
arXiv Detail & Related papers (2025-02-17T02:49:26Z)
An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks [87.64278063236847]
In this work, we propose a unified threat model for the principled comparison of jailbreak attacks.<n>Our threat model checks if a given jailbreak is likely to occur in the distribution of text.<n>We adapt popular attacks to this threat model, and, for the first time, benchmark these attacks on equal footing with it.
arXiv Detail & Related papers (2024-10-21T17:27:01Z)
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.<n>It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.<n>Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z)
Jailbreaking Large Language Models Through Alignment Vulnerabilities in Out-of-Distribution Settings [57.136748215262884]
We introduce ObscurePrompt for jailbreaking LLMs, inspired by the observed fragile alignments in Out-of-Distribution (OOD) data.<n>We first formulate the decision boundary in the jailbreaking process and then explore how obscure text affects LLM's ethical decision boundary.<n>Our approach substantially improves upon previous methods in terms of attack effectiveness, maintaining efficacy against two prevalent defense mechanisms.
arXiv Detail & Related papers (2024-06-19T16:09:58Z)
Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt [60.54666043358946]
This paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively. In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts.
arXiv Detail & Related papers (2024-06-06T13:00:42Z)
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [54.95912006700379]
We introduce AutoDAN, a novel jailbreak attack against aligned Large Language Models. AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm.
arXiv Detail & Related papers (2023-10-03T19:44:37Z)
Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models [28.37026309925163]
Large language models (LLMs) are designed to align with human values and generate safe text. Previous benchmarks for jailbreaking LLMs have primarily focused on evaluating the safety of the models. This paper assesses both the safety and robustness of LLMs, emphasizing the need for a balanced approach.
arXiv Detail & Related papers (2023-07-17T13:49:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.