Related papers: When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques

When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques

URL: http://arxiv.org/abs/2505.16765v1
Date: Thu, 22 May 2025 15:07:34 GMT
Title: When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques
Authors: Jianing Geng, Biao Yi, Zekun Fei, Tongxi Wu, Lihai Nie, Zheli Liu,
Abstract summary: Jailbreak attacks pose a serious threat to large language models (LLMs)<n>This paper presents a systematic survey of jailbreak methods from the novel perspective of stealth.<n>We propose StegoAttack, a stealthy jailbreak attack that uses steganography to hide the harmful query within benign, semantically coherent text.
Score: 5.2431999629987
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Jailbreak attacks pose a serious threat to large language models (LLMs) by bypassing built-in safety mechanisms and leading to harmful outputs. Studying these attacks is crucial for identifying vulnerabilities and improving model security. This paper presents a systematic survey of jailbreak methods from the novel perspective of stealth. We find that existing attacks struggle to simultaneously achieve toxic stealth (concealing toxic content) and linguistic stealth (maintaining linguistic naturalness). Motivated by this, we propose StegoAttack, a fully stealthy jailbreak attack that uses steganography to hide the harmful query within benign, semantically coherent text. The attack then prompts the LLM to extract the hidden query and respond in an encrypted manner. This approach effectively hides malicious intent while preserving naturalness, allowing it to evade both built-in and external safety mechanisms. We evaluate StegoAttack on four safety-aligned LLMs from major providers, benchmarking against eight state-of-the-art methods. StegoAttack achieves an average attack success rate (ASR) of 92.00%, outperforming the strongest baseline by 11.0%. Its ASR drops by less than 1% even under external detection (e.g., Llama Guard). Moreover, it attains the optimal comprehensive scores on stealth detection metrics, demonstrating both high efficacy and exceptional stealth capabilities. The code is available at https://anonymous.4open.science/r/StegoAttack-Jail66

Related papers

Bag of Tricks for Subverting Reasoning-based Safety Guardrails [62.139297207938036]
We present a bag of jailbreak methods that subvert the reasoning-based guardrails.<n>Our attacks span white-, gray-, and black-box settings and range from effortless template manipulations to fully automated optimization.
arXiv Detail & Related papers (2025-10-13T16:16:44Z)
DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z)
PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking [3.718606661938873]
We propose a novel and effective jailbreak framework inspired by Return-Oriented Programming (ROP) techniques from software security.<n>Our approach decomposes a harmful instruction into a sequence of individually benign visual gadgets.<n>Our findings reveal a critical and underexplored vulnerability that exploits the compositional reasoning abilities of LVLMs.
arXiv Detail & Related papers (2025-07-29T07:13:56Z)
ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning [64.32925552574115]
ARMOR is a large language model that analyzes jailbreak strategies and extracts the core intent.<n> ARMOR achieves state-of-the-art safety performance, with an average harmful rate of 0.002 and an attack success rate of 0.06 against advanced optimization-based jailbreaks.
arXiv Detail & Related papers (2025-07-14T09:05:54Z)
Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs [61.916827858666906]
We reveal a universal phenomenon that occurs during jailbreak attacks: Attention Slipping.<n>We show Attention Slipping is consistent across various jailbreak methods, including gradient-based token replacement, prompt-level template refinement, and in-context learning.<n>We propose Attention Sharpening, a new defense that directly counters Attention Slipping by sharpening the attention score distribution using temperature scaling.
arXiv Detail & Related papers (2025-07-06T12:19:04Z)
Through the Stealth Lens: Rethinking Attacks and Defenses in RAG [21.420202472493425]
We show that RevalVariRAG systems are vulnerable to poisoned passages into the set, even at low corruption rates.<n>We show that attacks at even low rates are not designed to be reliable, allowing detection and mitigation.
arXiv Detail & Related papers (2025-06-04T19:15:09Z)
Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models [69.11679786018206]
Supervised fine-tuning (SFT) aligns large language models with human intent by training them on labeled task-specific data.<n>Recent studies have shown that malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer pairs.<n>We propose a novel clean-data backdoor attack for jailbreaking LLMs.
arXiv Detail & Related papers (2025-05-23T08:13:59Z)
Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement [48.50995874445193]
Large Language Models (LLMs) have shown impressive capabilities across various tasks but remain vulnerable to meticulously crafted jailbreak attacks.<n>We propose SAGE (Self-Aware Guard Enhancement), a training-free defense strategy designed to align LLMs' strong safety discrimination performance with their relatively weaker safety generation ability.
arXiv Detail & Related papers (2025-05-17T15:54:52Z)
Prefill-level Jailbreak: A Black-Box Risk Analysis of Large Language Models [6.049325292667881]
We present a systematic black-box security analysis of prefill-level jailbreak attacks.<n>Our experiments show that prefill-level attacks achieve high success rates, with adaptive methods exceeding 99% on several models.<n>We find that a detection method focusing on the manipulative relationship between the prompt and the prefill is more effective.
arXiv Detail & Related papers (2025-04-28T07:38:43Z)
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z)
Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models [55.253208152184065]
Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text.<n>We conduct a detailed analysis of seven different jailbreak methods and find that disagreements stem from insufficient observation samples.<n>We propose a novel defense called textbfActivation Boundary Defense (ABD), which adaptively constrains the activations within the safety boundary.
arXiv Detail & Related papers (2024-12-22T14:18:39Z)
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.<n>It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.<n>Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z)
The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models [8.423787598133972]
This paper uncovers a critical vulnerability in the function calling process of large language models (LLMs)<n>We introduce a novel "jailbreak function" attack method that exploits alignment discrepancies, user coercion, and the absence of rigorous safety filters.<n>Our findings highlight the urgent need for enhanced security measures in the function calling capabilities of LLMs.
arXiv Detail & Related papers (2024-07-25T10:09:21Z)
LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models [20.154877919740322]
Existing jailbreak methods suffer from two main limitations: reliance on complicated prompt engineering and iterative optimization.<n>We propose an efficient jailbreak attack method, Analyzing-based Jailbreak (ABJ), which leverages the advanced reasoning capability of LLMs to autonomously generate harmful content.
arXiv Detail & Related papers (2024-07-23T06:14:41Z)
JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks [34.95274579737075]
JailGuard is a universal detection framework for prompt-based attacks across text and image modalities.<n>It operates on the principle that attacks are inherently less robust than benign ones.<n>It achieves the best detection accuracy of 86.14%/82.90% on text and image inputs, outperforming state-of-the-art methods by 11.81%-25.73% and 12.20%-21.40%.
arXiv Detail & Related papers (2023-12-17T17:02:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.