Related papers: Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation

Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation

URL: http://arxiv.org/abs/2505.18556v1
Date: Sat, 24 May 2025 06:47:32 GMT
Title: Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation
Authors: Jun Zhuang, Haibo Jin, Ye Zhang, Zhengjian Kang, Wenbin Zhang, Gaby G. Dagher, Haohan Wang,
Abstract summary: We investigate the vulnerability of intent-aware guardrails and demonstrate that large language models exhibit implicit intent detection capabilities.<n>We propose a two-stage intent-based prompt-refinement framework, IntentPrompt, that first transforms harmful inquiries into structured outlines and further reframes them into declarative-style narratives.<n>Our framework consistently outperforms several cutting-edge jailbreak methods and evades even advanced Intent Analysis (IA) and Chain-of-Thought (CoT)-based defenses.
Score: 18.37303422539757
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Intent detection, a core component of natural language understanding, has considerably evolved as a crucial mechanism in safeguarding large language models (LLMs). While prior work has applied intent detection to enhance LLMs' moderation guardrails, showing a significant success against content-level jailbreaks, the robustness of these intent-aware guardrails under malicious manipulations remains under-explored. In this work, we investigate the vulnerability of intent-aware guardrails and demonstrate that LLMs exhibit implicit intent detection capabilities. We propose a two-stage intent-based prompt-refinement framework, IntentPrompt, that first transforms harmful inquiries into structured outlines and further reframes them into declarative-style narratives by iteratively optimizing prompts via feedback loops to enhance jailbreak success for red-teaming purposes. Extensive experiments across four public benchmarks and various black-box LLMs indicate that our framework consistently outperforms several cutting-edge jailbreak methods and evades even advanced Intent Analysis (IA) and Chain-of-Thought (CoT)-based defenses. Specifically, our "FSTR+SPIN" variant achieves attack success rates ranging from 88.25% to 96.54% against CoT-based defenses on the o1 model, and from 86.75% to 97.12% on the GPT-4o model under IA-based defenses. These findings highlight a critical weakness in LLMs' safety mechanisms and suggest that intent manipulation poses a growing challenge to content moderation guardrails.

Related papers

PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking [3.718606661938873]
We propose a novel and effective jailbreak framework inspired by Return-Oriented Programming (ROP) techniques from software security.<n>Our approach decomposes a harmful instruction into a sequence of individually benign visual gadgets.<n>Our findings reveal a critical and underexplored vulnerability that exploits the compositional reasoning abilities of LVLMs.
arXiv Detail & Related papers (2025-07-29T07:13:56Z)
ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning [49.47193675702453]
Large Language Models (LLMs) have demonstrated remarkable generative capabilities.<n>LLMs remain vulnerable to malicious instructions that can bypass safety constraints.<n>We propose a reasoning-based safety alignment framework, ARMOR, that replaces the ad-hoc chains of thought reasoning process with human-aligned, structured one.
arXiv Detail & Related papers (2025-07-14T09:05:54Z)
Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs [83.11815479874447]
We propose a novel jailbreak attack framework, inspired by cognitive decomposition and biases in human cognition.<n>We employ cognitive decomposition to reduce the complexity of malicious prompts and relevance bias to reorganize prompts.<n>We also introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm.
arXiv Detail & Related papers (2025-05-03T05:28:11Z)
Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking [15.953888359667497]
jailbreak attacks based on prompt engineering have become a major safety threat.<n>This study introduces the concept of Defense Threshold Decay (DTD), revealing the potential safety impact caused by LLMs' benign generation.<n>We propose the Sugar-Coated Poison attack paradigm, which uses a "semantic reversal" strategy to craft benign inputs that are opposite in meaning to malicious intent.
arXiv Detail & Related papers (2025-04-08T03:57:09Z)
Improving LLM Safety Alignment with Dual-Objective Optimization [65.41451412400609]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks.<n>We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z)
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.<n>It reformulates harmful queries into benign reasoning tasks.<n>We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z)
Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions [51.51850981481236]
We introduce POATE, a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses.<n>PoATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety.<n>To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses.
arXiv Detail & Related papers (2025-01-03T15:40:03Z)
PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach [25.31933913962953]
Large Language Models (LLMs) have gained widespread use, raising concerns about their security. We introduce PathSeeker, a novel black-box jailbreak method, which is inspired by the game of rats escaping a maze. Our method outperforms five state-of-the-art attack techniques when tested across 13 commercial and open-source LLMs.
arXiv Detail & Related papers (2024-09-21T15:36:26Z)
AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs [34.221522224051846]
We propose an adaptive position pre-fill jailbreak attack approach for executing jailbreak attacks on Large Language Models (LLMs) Our method leverages the model's instruction-following capabilities to first output safe content, then exploits its narrative-shifting abilities to generate harmful content. Our method can improve the attack success rate by 47% on the widely recognized secure model (Llama2) compared to existing approaches.
arXiv Detail & Related papers (2024-09-11T00:00:58Z)
LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models [20.154877919740322]
Existing jailbreak methods suffer from two main limitations: reliance on complicated prompt engineering and iterative optimization.<n>We propose an efficient jailbreak attack method, Analyzing-based Jailbreak (ABJ), which leverages the advanced reasoning capability of LLMs to autonomously generate harmful content.
arXiv Detail & Related papers (2024-07-23T06:14:41Z)
Jailbreaking as a Reward Misspecification Problem [80.52431374743998]
We propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process.<n>We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness.<n>We present ReMiss, a system for automated red teaming that generates adversarial prompts in a reward-misspecified space.
arXiv Detail & Related papers (2024-06-20T15:12:27Z)
Jailbreaking Large Language Models Through Alignment Vulnerabilities in Out-of-Distribution Settings [57.136748215262884]
We introduce ObscurePrompt for jailbreaking LLMs, inspired by the observed fragile alignments in Out-of-Distribution (OOD) data.<n>We first formulate the decision boundary in the jailbreaking process and then explore how obscure text affects LLM's ethical decision boundary.<n>Our approach substantially improves upon previous methods in terms of attack effectiveness, maintaining efficacy against two prevalent defense mechanisms.
arXiv Detail & Related papers (2024-06-19T16:09:58Z)
JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks [34.95274579737075]
JailGuard is a universal detection framework for prompt-based attacks across text and image modalities.<n>It operates on the principle that attacks are inherently less robust than benign ones.<n>It achieves the best detection accuracy of 86.14%/82.90% on text and image inputs, outperforming state-of-the-art methods by 11.81%-25.73% and 12.20%-21.40%.
arXiv Detail & Related papers (2023-12-17T17:02:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.