Related papers: Jailbreaking LLMs Without Gradients or Priors: Effective and Transferable Attacks

Jailbreaking LLMs Without Gradients or Priors: Effective and Transferable Attacks

URL: http://arxiv.org/abs/2601.03420v1
Date: Tue, 06 Jan 2026 21:14:13 GMT
Title: Jailbreaking LLMs Without Gradients or Priors: Effective and Transferable Attacks
Authors: Zhakshylyk Nurlanov, Frank R. Schmidt, Florian Bernard,
Abstract summary: We introduce RAILS, a framework that operates solely on model logits.<n>By eliminating gradient dependency, RAILS enables cross-tokenizer ensemble attacks.<n> Empirically, RAILS achieves near 100% success rates on multiple open-source models and high black-box attack transferability to closed-source systems like GPT and Gemini.
Score: 22.52730333160258
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As Large Language Models (LLMs) are increasingly deployed in safety-critical domains, rigorously evaluating their robustness against adversarial jailbreaks is essential. However, current safety evaluations often overestimate robustness because existing automated attacks are limited by restrictive assumptions. They typically rely on handcrafted priors or require white-box access for gradient propagation. We challenge these constraints by demonstrating that token-level iterative optimization can succeed without gradients or priors. We introduce RAILS (RAndom Iterative Local Search), a framework that operates solely on model logits. RAILS matches the effectiveness of gradient-based methods through two key innovations: a novel auto-regressive loss that enforces exact prefix matching, and a history-based selection strategy that bridges the gap between the proxy optimization objective and the true attack success rate. Crucially, by eliminating gradient dependency, RAILS enables cross-tokenizer ensemble attacks. This allows for the discovery of shared adversarial patterns that generalize across disjoint vocabularies, significantly enhancing transferability to closed-source systems. Empirically, RAILS achieves near 100% success rates on multiple open-source models and high black-box attack transferability to closed-source systems like GPT and Gemini.

Related papers

RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models [60.201244463046784]
Large language models are vulnerable to jailbreak attacks.<n>This paper studies black-box multi-turn jailbreaks, aiming to train attacker LLMs to elicit harmful content from black-box models.
arXiv Detail & Related papers (2025-12-08T17:42:59Z)
RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs [17.313975711973374]
RAID (Refusal-Aware and Integrated Decoding) is a framework that crafts adversarial suffixes that induce restricted content while preserving fluency.<n>We show that RAID achieves higher attack success rates with fewer queries and lower computational cost than recent white-box and black-box baselines.
arXiv Detail & Related papers (2025-10-14T19:33:09Z)
Untargeted Jailbreak Attack [42.94437968995701]
gradient-based jailbreak attacks on Large Language Models (LLMs)<n>We propose the first gradient-based untargeted jailbreak attack (UJA), aiming to elicit an unsafe response without enforcing any predefined patterns.<n>Extensive evaluations demonstrate that UJA can achieve over 80% attack success rates against recent safety-aligned LLMs with only 100 optimization iterations.
arXiv Detail & Related papers (2025-10-03T13:38:56Z)
bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs [33.470999703070866]
Existing approaches to embedding jailbreak triggers suffer from limitations including poor generalization, compromised stealthiness, or reduced contextual usability.<n>We propose bi-GRPO, a novel RL-based framework tailored explicitly for jailbreak backdoor injection.
arXiv Detail & Related papers (2025-09-24T05:56:41Z)
Universal and Transferable Adversarial Attack on Large Language Models Using Exponentiated Gradient Descent [1.1187085721899017]
Large language models (LLMs) are increasingly deployed in critical applications.<n>LLMs remain vulnerable to jailbreak attacks enabled by crafted adversarial triggers appended to user prompts.<n>We propose an intrinsic optimization method which directly optimize relaxed one-hot encodings of the adversarial suffix tokens.
arXiv Detail & Related papers (2025-08-20T17:03:32Z)
Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses [4.706534644850809]
Two primary inference-phase threats are token-level and prompt-level jailbreaks.<n>We propose two hybrid approaches that integrate token- and prompt-level techniques to enhance jailbreak effectiveness across diverse PTLMs.
arXiv Detail & Related papers (2025-06-27T07:26:33Z)
Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs [83.11815479874447]
We propose a novel jailbreak attack framework, inspired by cognitive decomposition and biases in human cognition.<n>We employ cognitive decomposition to reduce the complexity of malicious prompts and relevance bias to reorganize prompts.<n>We also introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm.
arXiv Detail & Related papers (2025-05-03T05:28:11Z)
Improving LLM Safety Alignment with Dual-Objective Optimization [81.98466438000086]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks.<n>We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z)
REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective [57.57786477441956]
We propose an adaptive and semantic optimization problem over the population of responses.<n>Our objective doubles the attack success rate (ASR) on Llama3 and increases the ASR from 2% to 50% with circuit breaker defense.
arXiv Detail & Related papers (2025-02-24T15:34:48Z)
Jailbreak Attack Initializations as Extractors of Compliance Directions [5.910850302054065]
Safety-aligned LLMs respond to prompts with either compliance or refusal.<n>Recent works show that initializing attacks via self-transfer from other prompts significantly enhances their performance.<n>We propose CRI, an framework that aims to project unseen prompts further along compliance directions.
arXiv Detail & Related papers (2025-02-13T20:25:40Z)
Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities [50.980446687774645]
We introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability.<n>Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100% ASR on various open-source LLMs.<n>It exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3.
arXiv Detail & Related papers (2024-10-24T06:36:12Z)
Advancing Generalized Transfer Attack with Initialization Derived Bilevel Optimization and Dynamic Sequence Truncation [49.480978190805125]
Transfer attacks generate significant interest for black-box applications. Existing works essentially directly optimize the single-level objective w.r.t. surrogate model. We propose a bilevel optimization paradigm, which explicitly reforms the nested relationship between the Upper-Level (UL) pseudo-victim attacker and the Lower-Level (LL) surrogate attacker.
arXiv Detail & Related papers (2024-06-04T07:45:27Z)
Improved Generation of Adversarial Examples Against Safety-aligned LLMs [72.38072942860309]
Adversarial prompts generated using gradient-based methods exhibit outstanding performance in performing automatic jailbreak attacks against safety-aligned LLMs. In this paper, we explore a new perspective on this problem, suggesting that it can be alleviated by leveraging innovations inspired in transfer-based attacks. We show that 87% of the query-specific adversarial suffixes generated by the developed combination can induce Llama-2-7B-Chat to produce the output that exactly matches the target string on AdvBench.
arXiv Detail & Related papers (2024-05-28T06:10:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.