TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models
- URL: http://arxiv.org/abs/2603.03081v1
- Date: Tue, 03 Mar 2026 15:25:53 GMT
- Title: TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models
- Authors: Zhi Xu, Jiaqi Li, Xiaotong Zhang, Hong Yu, Han Liu,
- Abstract summary: Large language models (LLMs) have achieved remarkable success across diverse applications but remain vulnerable to jailbreak attacks.<n>We propose TAO-Attack, a new optimization-based jailbreak method.<n> TAO-Attack consistently outperforms state-of-the-art methods, achieving higher attack success rates and even reaching 100% in certain scenarios.
- Score: 15.495882533240833
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have achieved remarkable success across diverse applications but remain vulnerable to jailbreak attacks, where attackers craft prompts that bypass safety alignment and elicit unsafe responses. Among existing approaches, optimization-based attacks have shown strong effectiveness, yet current methods often suffer from frequent refusals, pseudo-harmful outputs, and inefficient token-level updates. In this work, we propose TAO-Attack, a new optimization-based jailbreak method. TAO-Attack employs a two-stage loss function: the first stage suppresses refusals to ensure the model continues harmful prefixes, while the second stage penalizes pseudo-harmful outputs and encourages the model toward more harmful completions. In addition, we design a direction-priority token optimization (DPTO) strategy that improves efficiency by aligning candidates with the gradient direction before considering update magnitude. Extensive experiments on multiple LLMs demonstrate that TAO-Attack consistently outperforms state-of-the-art methods, achieving higher attack success rates and even reaching 100\% in certain scenarios.
Related papers
- Jailbreaking LLMs Without Gradients or Priors: Effective and Transferable Attacks [22.52730333160258]
We introduce RAILS, a framework that operates solely on model logits.<n>By eliminating gradient dependency, RAILS enables cross-tokenizer ensemble attacks.<n> Empirically, RAILS achieves near 100% success rates on multiple open-source models and high black-box attack transferability to closed-source systems like GPT and Gemini.
arXiv Detail & Related papers (2026-01-06T21:14:13Z) - RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models [60.201244463046784]
Large language models are vulnerable to jailbreak attacks.<n>This paper studies black-box multi-turn jailbreaks, aiming to train attacker LLMs to elicit harmful content from black-box models.
arXiv Detail & Related papers (2025-12-08T17:42:59Z) - Untargeted Jailbreak Attack [42.94437968995701]
gradient-based jailbreak attacks on Large Language Models (LLMs)<n>We propose the first gradient-based untargeted jailbreak attack (UJA), aiming to elicit an unsafe response without enforcing any predefined patterns.<n>Extensive evaluations demonstrate that UJA can achieve over 80% attack success rates against recent safety-aligned LLMs with only 100 optimization iterations.
arXiv Detail & Related papers (2025-10-03T13:38:56Z) - Stand on The Shoulders of Giants: Building JailExpert from Previous Attack Experience [36.525169416008886]
Large language models (LLMs) generate human-aligned content under certain safety constraints.<n>The textbfJailExpert framework is the first to achieve a formal representation of experience structure.<n>JailExpert achieves an average increase of 17% in attack success rate and 2.7 times improvement in attack efficiency.
arXiv Detail & Related papers (2025-08-25T14:16:30Z) - Universal and Transferable Adversarial Attack on Large Language Models Using Exponentiated Gradient Descent [1.1187085721899017]
Large language models (LLMs) are increasingly deployed in critical applications.<n>LLMs remain vulnerable to jailbreak attacks enabled by crafted adversarial triggers appended to user prompts.<n>We propose an intrinsic optimization method which directly optimize relaxed one-hot encodings of the adversarial suffix tokens.
arXiv Detail & Related papers (2025-08-20T17:03:32Z) - T2V-OptJail: Discrete Prompt Optimization for Text-to-Video Jailbreak Attacks [67.91652526657599]
We formalize the T2V jailbreak attack as a discrete optimization problem and propose a joint objective-based optimization framework, called T2V-OptJail.<n>We conduct large-scale experiments on several T2V models, covering both open-source models and real commercial closed-source models.<n>The proposed method improves 11.4% and 10.0% over the existing state-of-the-art method in terms of attack success rate.
arXiv Detail & Related papers (2025-05-10T16:04:52Z) - Improving LLM Safety Alignment with Dual-Objective Optimization [81.98466438000086]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks.<n>We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z) - An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks [87.64278063236847]
In this work, we propose a unified threat model for the principled comparison of jailbreak attacks.<n>Our threat model checks if a given jailbreak is likely to occur in the distribution of text.<n>We adapt popular attacks to this threat model, and, for the first time, benchmark these attacks on equal footing with it.
arXiv Detail & Related papers (2024-10-21T17:27:01Z) - Boosting Jailbreak Transferability for Large Language Models [10.884050438726215]
We propose a scenario induction template, optimized suffix selection, and the integration of re-suffix attack mechanism to reduce inconsistent outputs.
Our approach has shown superior performance in extensive experiments across various benchmarks, achieving nearly 100% success rates in both attack execution and transferability.
arXiv Detail & Related papers (2024-10-21T05:11:19Z) - Advancing Generalized Transfer Attack with Initialization Derived Bilevel Optimization and Dynamic Sequence Truncation [49.480978190805125]
Transfer attacks generate significant interest for black-box applications.
Existing works essentially directly optimize the single-level objective w.r.t. surrogate model.
We propose a bilevel optimization paradigm, which explicitly reforms the nested relationship between the Upper-Level (UL) pseudo-victim attacker and the Lower-Level (LL) surrogate attacker.
arXiv Detail & Related papers (2024-06-04T07:45:27Z) - Attacking Large Language Models with Projected Gradient Descent [49.19426387912186]
Projected Gradient Descent (PGD) for adversarial prompts is up to one order of magnitude faster than state-of-the-art discrete optimization.<n>Our PGD for LLMs is up to one order of magnitude faster than state-of-the-art discrete optimization to achieve the same devastating attack results.
arXiv Detail & Related papers (2024-02-14T13:13:26Z) - Revisiting and Advancing Fast Adversarial Training Through The Lens of
Bi-Level Optimization [60.72410937614299]
We propose a new tractable bi-level optimization problem, design and analyze a new set of algorithms termed Bi-level AT (FAST-BAT)
FAST-BAT is capable of defending sign-based projected descent (PGD) attacks without calling any gradient sign method and explicit robust regularization.
arXiv Detail & Related papers (2021-12-23T06:25:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.