Related papers: When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment

When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment

URL: http://arxiv.org/abs/2506.07452v2
Date: Thu, 16 Oct 2025 06:50:23 GMT
Title: When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment
Authors: Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Suriyakumar, Marzyeh Ghassemi,
Abstract summary: Large language models (LLMs) can be prompted with specific styles, including in malicious queries.<n>The impact of style patterns in the original queries that are semantically irrelevant to the malicious intent remains unclear.<n>We propose SafeStyle, a defense strategy that incorporates a small amount of safety training data augmented to match the distribution of style patterns in the fine-tuning data.
Score: 21.638179430757116
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) can be prompted with specific styles (e.g., formatting responses as lists), including in malicious queries. Prior jailbreak research mainly augments these queries with additional string transformations to maximize attack success rate (ASR). However, the impact of style patterns in the original queries that are semantically irrelevant to the malicious intent remains unclear. In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment. We first define ASR inflation as the increase in ASR due to style patterns in existing jailbreak benchmark queries. By evaluating 32 LLMs across seven benchmarks, we find that nearly all models exhibit ASR inflation. Notably, the inflation correlates with an LLM's relative attention to style patterns, which also overlap more with its instruction-tuning data when inflation occurs. We then investigate superficial style alignment, and find that fine-tuning with specific styles makes LLMs more vulnerable to jailbreaks of those same styles. Finally, we propose SafeStyle, a defense strategy that incorporates a small amount of safety training data augmented to match the distribution of style patterns in the fine-tuning data. Across three LLMs, six fine-tuning style settings, and two real-world instruction-tuning datasets, SafeStyle consistently outperforms baselines in maintaining LLM safety.

Related papers

Say It Differently: Linguistic Styles as Jailbreak Vectors [0.763334557068953]
We study how linguistic styles such as fear or curiosity can reframe harmful intent and elicit unsafe responses from aligned models.<n>We construct style-augmented jailbreak benchmark by transforming prompts from 3 standard datasets into 11 distinct linguistic styles.<n>Styles such as fearful, curious and compassionate are most effective and contextualized rewrites outperform templated variants.
arXiv Detail & Related papers (2025-11-13T17:24:38Z)
Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment [24.364891513019444]
In this paper, we show that fine-tuning datasets often contain samples with safety-degrading features that are not easily identifiable on the surface.<n>We propose LARF, a Layer-Aware Representation Filtering method.<n> Experimental results demonstrate that LARF can effectively identify benign data with safety-degrading features.
arXiv Detail & Related papers (2025-07-24T17:59:24Z)
Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region [13.962617572588393]
We show that template-anchored safety alignment is widespread across various aligned large language models (LLMs)<n>Our mechanistic analyses demonstrate how it leads to models' susceptibility when encountering inference-time jailbreak attacks.<n>We show that detaching safety mechanisms from the template region is promising in mitigating vulnerabilities to jailbreak attacks.
arXiv Detail & Related papers (2025-02-19T18:42:45Z)
StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language Models [3.0308780927465135]
We present a series of structure transformation attacks on LLM alignment, where we encode natural language intent using diverse syntax spaces.<n>Our simplest attacks can achieve close to a 90% success rate, even on strict LLMs.<n>We develop a benchmark and evaluate existing safety-alignment defenses against it, showing that most of them fail with 100% ASR.
arXiv Detail & Related papers (2025-02-17T14:46:38Z)
RIDE: Enhancing Large Language Model Alignment through Restyled In-Context Learning Demonstration Exemplars [57.6513924960128]
Alignment tuning is crucial for ensuring large language models (LLMs) behave ethically and helpfully.<n>This paper proposes a low-cost, tuning-free method using in-context learning (ICL) to enhance LLM alignment.
arXiv Detail & Related papers (2025-02-17T11:16:19Z)
CCJA: Context-Coherent Jailbreak Attack for Aligned Large Language Models [18.06388944779541]
"jailbreaking" is the use of large language models to trigger unintended behaviors.<n>We propose a novel method to balance the jailbreak attack success rate with semantic coherence.<n>Our method is superior to state-of-the-art baselines in attack effectiveness.
arXiv Detail & Related papers (2025-02-17T02:49:26Z)
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z)
Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models [55.253208152184065]
Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text.<n>We conduct a detailed analysis of seven different jailbreak methods and find that disagreements stem from insufficient observation samples.<n>We propose a novel defense called textbfActivation Boundary Defense (ABD), which adaptively constrains the activations within the safety boundary.
arXiv Detail & Related papers (2024-12-22T14:18:39Z)
Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment [16.5939079098358]
In this paper, we study how simple random augmentations to the input prompt affect safety alignment effectiveness in state-of-the-art LLMs.<n>We show that low-resource and unsophisticated attackers can significantly improve their chances of bypassing alignment with just 25 random augmentations per prompt.
arXiv Detail & Related papers (2024-11-05T03:51:13Z)
Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities [50.980446687774645]
We introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability.<n>Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100% ASR on various open-source LLMs.<n>It exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3.
arXiv Detail & Related papers (2024-10-24T06:36:12Z)
Data to Defense: The Role of Curation in Customizing LLMs Against Jailbreaking Attacks [13.381678819086469]
Large language models (LLMs) are widely adapted for downstream applications through fine-tuning, a process named customization.<n> malicious samples can compromise the robustness of LLMs and amplify harmful behaviors-an attack commonly referred to as jailbreaking.<n>We propose an adaptive data curation approach allowing any text to be curated to enhance its effectiveness in counteracting harmful samples during customization.
arXiv Detail & Related papers (2024-10-03T05:24:38Z)
AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs [34.221522224051846]
We propose an adaptive position pre-fill jailbreak attack approach for executing jailbreak attacks on Large Language Models (LLMs) Our method leverages the model's instruction-following capabilities to first output safe content, then exploits its narrative-shifting abilities to generate harmful content. Our method can improve the attack success rate by 47% on the widely recognized secure model (Llama2) compared to existing approaches.
arXiv Detail & Related papers (2024-09-11T00:00:58Z)
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs)<n>Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs.<n> Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z)
AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks. Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z)
Instruction Backdoor Attacks Against Customized LLMs [37.92008159382539]
We propose the first instruction backdoor attacks against applications integrated with untrusted customized LLMs. Our attack includes 3 levels of attacks: word-level, syntax-level, and semantic-level, which adopt different types of triggers with progressive stealthiness. We propose two defense strategies and demonstrate their effectiveness in reducing such attacks.
arXiv Detail & Related papers (2024-02-14T13:47:35Z)
Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks. Existing jailbreaking methods are computationally costly. We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z)
Fake Alignment: Are LLMs Really Aligned Well? [91.26543768665778]
This study investigates the substantial discrepancy in performance between multiple-choice questions and open-ended questions. Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization.
arXiv Detail & Related papers (2023-11-10T08:01:23Z)
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! [88.90694413503614]
We find that the safety alignment of LLMs can be compromised by fine-tuning. We jailbreak GPT-3.5 Turbo's safety guardrails by fine-tuning it on only 10 such examples. We advocate for further research efforts toward reinforcing safety protocols for the custom fine-tuning of aligned LLMs.
arXiv Detail & Related papers (2023-10-05T17:12:17Z)
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs) Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z)
Meta-Tuning LLMs to Leverage Lexical Knowledge for Generalizable Language Style Understanding [24.355564722047244]
We show that current large language models struggle to capture some language styles without fine-tuning. We investigate whether LLMs can be meta-trained based on representative lexicons to recognize new styles they have not been fine-tuned on.
arXiv Detail & Related papers (2023-05-24T00:17:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.