Related papers: Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models

Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models

URL: http://arxiv.org/abs/2509.05471v1
Date: Fri, 05 Sep 2025 19:57:38 GMT
Title: Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models
Authors: Youjia Zheng, Mohammad Zandsalimy, Shanu Sushmita,
Abstract summary: camouflaged jailbreaking embeds malicious intent within seemingly benign language to evade existing safety mechanisms.<n>This paper investigates the construction and impact of camouflaged jailbreak prompts, emphasizing their deceptive characteristics and the limitations of traditional keyword-based detection methods.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are increasingly vulnerable to a sophisticated form of adversarial prompting known as camouflaged jailbreaking. This method embeds malicious intent within seemingly benign language to evade existing safety mechanisms. Unlike overt attacks, these subtle prompts exploit contextual ambiguity and the flexible nature of language, posing significant challenges to current defense systems. This paper investigates the construction and impact of camouflaged jailbreak prompts, emphasizing their deceptive characteristics and the limitations of traditional keyword-based detection methods. We introduce a novel benchmark dataset, Camouflaged Jailbreak Prompts, containing 500 curated examples (400 harmful and 100 benign prompts) designed to rigorously stress-test LLM safety protocols. In addition, we propose a multi-faceted evaluation framework that measures harmfulness across seven dimensions: Safety Awareness, Technical Feasibility, Implementation Safeguards, Harmful Potential, Educational Value, Content Quality, and Compliance Score. Our findings reveal a stark contrast in LLM behavior: while models demonstrate high safety and content quality with benign inputs, they exhibit a significant decline in performance and safety when confronted with camouflaged jailbreak attempts. This disparity underscores a pervasive vulnerability, highlighting the urgent need for more nuanced and adaptive security strategies to ensure the responsible and robust deployment of LLMs in real-world applications.

Related papers

A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness [32.47621091096285]
Safety alignment aims to prevent Large Language Models (LLMs) from responding to harmful queries.<n>In this paper, we introduce HILL, a novel jailbreak approach that transforms imperative harmful requests into learning-style questions.<n> Experiments on the AdvBench dataset across a wide range of models demonstrate HILL's strong effectiveness, generalizability, and harmfulness.
arXiv Detail & Related papers (2025-09-17T04:21:20Z)
GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing [13.267217024192535]
Jailbreak attacks reveal critical vulnerabilities in Large Language Models (LLMs)<n>We introduce GuardVal, a new evaluation protocol that generates and refines jailbreak prompts based on the defender LLM's state.<n>We apply this protocol to a diverse set of models, from Mistral-7b to GPT-4, across 10 safety domains.
arXiv Detail & Related papers (2025-07-10T13:15:20Z)
SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism [123.54980913741828]
Multimodal Large Language Models (MLLMs) extend LLMs to support visual reasoning.<n>MLLMs are susceptible to multimodal jailbreak attacks and hindering their safe deployment.<n>We propose Safe Prune-then-Restore (SafePTR), a training-free defense framework that selectively prunes harmful tokens at vulnerable layers while restoring benign features at subsequent layers.
arXiv Detail & Related papers (2025-07-02T09:22:03Z)
Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models [11.867355323884217]
We present a novel black-box jailbreak attack framework that decomposes malicious prompts into semantically benign visual and textual fragments.<n>Our approach supports adjustable reasoning complexity and requires significantly fewer queries than prior attacks, enabling both stealth and efficiency.
arXiv Detail & Related papers (2025-06-20T05:30:25Z)
Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement [48.50995874445193]
Large Language Models (LLMs) have shown impressive capabilities across various tasks but remain vulnerable to meticulously crafted jailbreak attacks.<n>We propose SAGE (Self-Aware Guard Enhancement), a training-free defense strategy designed to align LLMs' strong safety discrimination performance with their relatively weaker safety generation ability.
arXiv Detail & Related papers (2025-05-17T15:54:52Z)
Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy [31.03584769307822]
We propose JOOD, a new Jailbreak framework via OOD-ifying inputs beyond the safety alignment.<n>Experiments across diverse jailbreak scenarios demonstrate that JOOD effectively jailbreaks recent proprietary LLMs and MLLMs.
arXiv Detail & Related papers (2025-03-26T01:25:24Z)
xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking [32.89084809038529]
Black-box jailbreak is an attack where crafted prompts bypass safety mechanisms in large language models.<n>We propose a novel black-box jailbreak method leveraging reinforcement learning (RL)<n>We introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success.
arXiv Detail & Related papers (2025-01-28T06:07:58Z)
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z)
Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models [55.253208152184065]
Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text.<n>We conduct a detailed analysis of seven different jailbreak methods and find that disagreements stem from insufficient observation samples.<n>We propose a novel defense called textbfActivation Boundary Defense (ABD), which adaptively constrains the activations within the safety boundary.
arXiv Detail & Related papers (2024-12-22T14:18:39Z)
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs)<n>Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs.<n> Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z)
AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks. Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.