RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs
- URL: http://arxiv.org/abs/2510.13901v1
- Date: Tue, 14 Oct 2025 19:33:09 GMT
- Title: RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs
- Authors: Tuan T. Nguyen, John Le, Thai T. Vu, Willy Susilo, Heath Cooper,
- Abstract summary: RAID (Refusal-Aware and Integrated Decoding) is a framework that crafts adversarial suffixes that induce restricted content while preserving fluency.<n>We show that RAID achieves higher attack success rates with fewer queries and lower computational cost than recent white-box and black-box baselines.
- Score: 17.313975711973374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) achieve impressive performance across diverse tasks yet remain vulnerable to jailbreak attacks that bypass safety mechanisms. We present RAID (Refusal-Aware and Integrated Decoding), a framework that systematically probes these weaknesses by crafting adversarial suffixes that induce restricted content while preserving fluency. RAID relaxes discrete tokens into continuous embeddings and optimizes them with a joint objective that (i) encourages restricted responses, (ii) incorporates a refusal-aware regularizer to steer activations away from refusal directions in embedding space, and (iii) applies a coherence term to maintain semantic plausibility and non-redundancy. After optimization, a critic-guided decoding procedure maps embeddings back to tokens by balancing embedding affinity with language-model likelihood. This integration yields suffixes that are both effective in bypassing defenses and natural in form. Experiments on multiple open-source LLMs show that RAID achieves higher attack success rates with fewer queries and lower computational cost than recent white-box and black-box baselines. These findings highlight the importance of embedding-space regularization for understanding and mitigating LLM jailbreak vulnerabilities.
Related papers
- Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility [26.564913442069866]
Vision language models (VLMs) extend the reasoning capabilities of large language models (LLMs) to cross-modal settings.<n>Existing defenses rely on safety fine-tuning or aggressive token manipulations, incurring substantial training costs or significantly degrading utility.<n>We propose Risk Awareness Injection (RAI), a lightweight and training-free framework for safety calibration.
arXiv Detail & Related papers (2026-02-03T11:26:05Z) - Jailbreaking LLMs Without Gradients or Priors: Effective and Transferable Attacks [22.52730333160258]
We introduce RAILS, a framework that operates solely on model logits.<n>By eliminating gradient dependency, RAILS enables cross-tokenizer ensemble attacks.<n> Empirically, RAILS achieves near 100% success rates on multiple open-source models and high black-box attack transferability to closed-source systems like GPT and Gemini.
arXiv Detail & Related papers (2026-01-06T21:14:13Z) - Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks [29.465042445657947]
New attacks expose large language models' inability to recognize unseen malicious instructions.<n>We propose IMAGINE, a synthesis framework that leverages embedding space distribution analysis to generate jailbreak-like instructions.<n>We show significant decreases in attack success rate on Qwen2.5, Llama3.1, and Llama3.2 without compromising their utility.
arXiv Detail & Related papers (2025-08-27T16:44:03Z) - SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism [123.54980913741828]
Multimodal Large Language Models (MLLMs) extend LLMs to support visual reasoning.<n>MLLMs are susceptible to multimodal jailbreak attacks and hindering their safe deployment.<n>We propose Safe Prune-then-Restore (SafePTR), a training-free defense framework that selectively prunes harmful tokens at vulnerable layers while restoring benign features at subsequent layers.
arXiv Detail & Related papers (2025-07-02T09:22:03Z) - Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models [11.867355323884217]
We present a novel black-box jailbreak attack framework that decomposes malicious prompts into semantically benign visual and textual fragments.<n>Our approach supports adjustable reasoning complexity and requires significantly fewer queries than prior attacks, enabling both stealth and efficiency.
arXiv Detail & Related papers (2025-06-20T05:30:25Z) - Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement [48.50995874445193]
Large Language Models (LLMs) have shown impressive capabilities across various tasks but remain vulnerable to meticulously crafted jailbreak attacks.<n>We propose SAGE (Self-Aware Guard Enhancement), a training-free defense strategy designed to align LLMs' strong safety discrimination performance with their relatively weaker safety generation ability.
arXiv Detail & Related papers (2025-05-17T15:54:52Z) - Token-Efficient Prompt Injection Attack: Provoking Cessation in LLM Reasoning via Adaptive Token Compression [12.215295420714787]
"Reasoning Interruption Attack" is a prompt injection attack based on adaptive token compression.<n>We develop a systematic approach to efficiently collect attack prompts and an adaptive token compression framework.<n> Experiments show our compression framework significantly reduces prompt length while maintaining effective attack capabilities.
arXiv Detail & Related papers (2025-04-29T07:34:22Z) - Improving LLM Safety Alignment with Dual-Objective Optimization [65.41451412400609]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks.<n>We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z) - Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking [54.10710423370126]
We propose Reasoning-to-Defend (R2D), a training paradigm that integrates a safety-aware reasoning mechanism into Large Language Models' generation process.<n>CPO enhances the model's perception of the safety status of given dialogues.<n>Experiments demonstrate that R2D effectively mitigates various attacks and improves overall safety, while maintaining the original performances.
arXiv Detail & Related papers (2025-02-18T15:48:46Z) - ADVLLM: Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities [63.603861880022954]
We introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability.<n>Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100% ASR on various open-source LLMs.<n>It exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3.
arXiv Detail & Related papers (2024-10-24T06:36:12Z) - Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models [79.0183835295533]
We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to assess the risk of such vulnerabilities.<n>Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content.<n>We propose two novel defense mechanisms-boundary awareness and explicit reminder-to address these vulnerabilities in both black-box and white-box settings.
arXiv Detail & Related papers (2023-12-21T01:08:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.