Related papers: Boosting Jailbreak Attack with Momentum

Boosting Jailbreak Attack with Momentum

URL: http://arxiv.org/abs/2405.01229v1
Date: Thu, 2 May 2024 12:18:14 GMT
Title: Boosting Jailbreak Attack with Momentum
Authors: Yihao Zhang, Zeming Wei,
Abstract summary: Large Language Models (LLMs) have achieved remarkable success across diverse tasks, yet they remain vulnerable to adversarial attacks. We introduce the textbfMomentum textbfAccelerated GtextbfCG (textbfMAC) attack, which incorporates a momentum term into the gradient.
Score: 5.047814998088682
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have achieved remarkable success across diverse tasks, yet they remain vulnerable to adversarial attacks, notably the well-documented \textit{jailbreak} attack. Recently, the Greedy Coordinate Gradient (GCG) attack has demonstrated efficacy in exploiting this vulnerability by optimizing adversarial prompts through a combination of gradient heuristics and greedy search. However, the efficiency of this attack has become a bottleneck in the attacking process. To mitigate this limitation, in this paper we rethink the generation of adversarial prompts through an optimization lens, aiming to stabilize the optimization process and harness more heuristic insights from previous iterations. Specifically, we introduce the \textbf{M}omentum \textbf{A}ccelerated G\textbf{C}G (\textbf{MAC}) attack, which incorporates a momentum term into the gradient heuristic. Experimental results showcase the notable enhancement achieved by MAP in gradient-based attacks on aligned language models. Our code is available at https://github.com/weizeming/momentum-attack-llm.

Related papers

Alignment Under Pressure: The Case for Informed Adversaries When Evaluating LLM Defenses [6.736255552371404]
Alignment is one of the main approaches used to defend against attacks such as prompt injection and jailbreaks.<n>Recent defenses report near-zero Attack Success Rates (ASR) even against Greedy Coordinate Gradient (GCG)
arXiv Detail & Related papers (2025-05-21T16:43:17Z)
Enhancing Adversarial Attacks through Chain of Thought [0.0]
gradient-based adversarial attacks are particularly effective against aligned large language models (LLMs) This paper proposes enhancing the universality of adversarial attacks by integrating CoT prompts with the greedy coordinate gradient (GCG) technique.
arXiv Detail & Related papers (2024-10-29T06:54:00Z)
A Realistic Threat Model for Large Language Model Jailbreaks [87.64278063236847]
In this work, we propose a unified threat model for the principled comparison of jailbreak attacks. Our threat model combines constraints in perplexity, measuring how far a jailbreak deviates from natural text. We adapt popular attacks to this new, realistic threat model, with which we, for the first time, benchmark these attacks on equal footing.
arXiv Detail & Related papers (2024-10-21T17:27:01Z)
Boosting Jailbreak Transferability for Large Language Models [10.884050438726215]
We propose a scenario induction template, optimized suffix selection, and the integration of re-suffix attack mechanism to reduce inconsistent outputs. Our approach has shown superior performance in extensive experiments across various benchmarks, achieving nearly 100% success rates in both attack execution and transferability.
arXiv Detail & Related papers (2024-10-21T05:11:19Z)
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts. It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks. Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z)
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques. We propose three comprehensive, automated, and logical frameworks. We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z)
Advancing Generalized Transfer Attack with Initialization Derived Bilevel Optimization and Dynamic Sequence Truncation [49.480978190805125]
Transfer attacks generate significant interest for black-box applications. Existing works essentially directly optimize the single-level objective w.r.t. surrogate model. We propose a bilevel optimization paradigm, which explicitly reforms the nested relationship between the Upper-Level (UL) pseudo-victim attacker and the Lower-Level (LL) surrogate attacker.
arXiv Detail & Related papers (2024-06-04T07:45:27Z)
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models [78.32176751215073]
Greedy Coordinate Gradient (GCG) attack's success has led to a growing interest in the study of optimization-based jailbreaking techniques. We present several improved (empirical) techniques for optimization-based jailbreaks like GCG. The results demonstrate that our improved techniques can help GCG outperform state-of-the-art jailbreaking attacks and achieve nearly 100% attack success rate.
arXiv Detail & Related papers (2024-05-31T17:07:15Z)
White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models. Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input. An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z)
Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing [107.97160023681184]
Aligned large language models (LLMs) are vulnerable to jailbreaking attacks. We propose SEMANTICSMOOTH, a smoothing-based defense that aggregates predictions of semantically transformed copies of a given input prompt.
arXiv Detail & Related papers (2024-02-25T20:36:03Z)
Attacking Large Language Models with Projected Gradient Descent [12.130638442765857]
Projected Gradient Descent (PGD) for adversarial prompts is up to one order of magnitude faster than state-of-the-art discrete optimization. Our PGD for LLMs is up to one order of magnitude faster than state-of-the-art discrete optimization to achieve the same devastating attack results.
arXiv Detail & Related papers (2024-02-14T13:13:26Z)
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks [17.22989422489567]
Large language models (LLMs) are vulnerable to adversarial attacks or jailbreaking. We propose an optimization-based objective for defending LLMs against jailbreaking attacks and an algorithm to create robust system-level defenses. Our results show improved robustness to both jailbreaks seen during optimization and unknown jailbreaks, reducing the attack success rate (ASR) on GPT-4 to 6% and Llama-2 to 0% on JailbreakBench.
arXiv Detail & Related papers (2024-01-30T18:56:08Z)
Guidance Through Surrogate: Towards a Generic Diagnostic Attack [101.36906370355435]
We develop a guided mechanism to avoid local minima during attack optimization, leading to a novel attack dubbed Guided Projected Gradient Attack (G-PGA) Our modified attack does not require random restarts, large number of attack iterations or search for an optimal step-size. More than an effective attack, G-PGA can be used as a diagnostic tool to reveal elusive robustness due to gradient masking in adversarial defenses.
arXiv Detail & Related papers (2022-12-30T18:45:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.