Related papers: Improved Techniques for Optimization-Based Jailbreaking on Large Language Models

Improved Techniques for Optimization-Based Jailbreaking on Large Language Models

URL: http://arxiv.org/abs/2405.21018v2
Date: Wed, 5 Jun 2024 16:35:49 GMT
Title: Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
Authors: Xiaojun Jia, Tianyu Pang, Chao Du, Yihao Huang, Jindong Gu, Yang Liu, Xiaochun Cao, Min Lin,
Abstract summary: Greedy Coordinate Gradient (GCG) attack's success has led to a growing interest in the study of optimization-based jailbreaking techniques. We present several improved (empirical) techniques for optimization-based jailbreaks like GCG. The results demonstrate that our improved techniques can help GCG outperform state-of-the-art jailbreaking attacks and achieve nearly 100% attack success rate.
Score: 78.32176751215073
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are being rapidly developed, and a key component of their widespread deployment is their safety-related alignment. Many red-teaming efforts aim to jailbreak LLMs, where among these efforts, the Greedy Coordinate Gradient (GCG) attack's success has led to a growing interest in the study of optimization-based jailbreaking techniques. Although GCG is a significant milestone, its attacking efficiency remains unsatisfactory. In this paper, we present several improved (empirical) techniques for optimization-based jailbreaks like GCG. We first observe that the single target template of "Sure" largely limits the attacking performance of GCG; given this, we propose to apply diverse target templates containing harmful self-suggestion and/or guidance to mislead LLMs. Besides, from the optimization aspects, we propose an automatic multi-coordinate updating strategy in GCG (i.e., adaptively deciding how many tokens to replace in each step) to accelerate convergence, as well as tricks like easy-to-hard initialisation. Then, we combine these improved technologies to develop an efficient jailbreak method, dubbed I-GCG. In our experiments, we evaluate on a series of benchmarks (such as NeurIPS 2023 Red Teaming Track). The results demonstrate that our improved techniques can help GCG outperform state-of-the-art jailbreaking attacks and achieve nearly 100% attack success rate. The code is released at https://github.com/jiaxiaojunQAQ/I-GCG.

Related papers

Geneshift: Impact of different scenario shift on Jailbreaking LLM [55.26229741296822]
We propose a black-box jailbreak attack termed GeneShift, by using a genetic algorithm to optimize the scenario shifts. We show that GeneShift increases the jailbreak success rate from 0% to 60% when direct prompting alone would fail.
arXiv Detail & Related papers (2025-04-10T20:02:35Z)
Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models [16.83476701024932]
Greedy Coordinate Gradient (GCG) method has demonstrated ability to automatically generate adversarial suffixes that jailbreak state-of-the-art LLMs. We propose the Model Attack Gradient Index GCG (MAGIC) that addresses the Indirect Effect by exploiting the gradient information of the suffix tokens. Experiments on AdvBench show that MAGIC achieves up to a 1.5x speedup, while maintaining Attack Success Rates (ASR) on par or even higher.
arXiv Detail & Related papers (2024-12-11T18:37:56Z)
LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds [98.20826635707341]
LIAR (Leveraging Inference time Alignment to jailbReak) is a fast and efficient best-of-N approach tailored for jailbreak attacks. Our results demonstrate that a best-of-N approach is a simple yet highly effective strategy for evaluating the robustness of aligned LLMs.
arXiv Detail & Related papers (2024-12-06T18:02:59Z)
AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts [10.536276489213497]
A generative model can quickly produce numerous customizable gibberish adversarial suffixes for any harmful query. We introduce AmpleGCG-Plus, an enhanced version that achieves better performance in fewer attempts. We jailbreak the newer GPT-4o series of models at similar rates to GPT-4, and, uncovers vulnerabilities against the recently proposed circuit breakers defense.
arXiv Detail & Related papers (2024-10-29T15:40:07Z)
Boosting Jailbreak Transferability for Large Language Models [10.884050438726215]
We propose a scenario induction template, optimized suffix selection, and the integration of re-suffix attack mechanism to reduce inconsistent outputs. Our approach has shown superior performance in extensive experiments across various benchmarks, achieving nearly 100% success rates in both attack execution and transferability.
arXiv Detail & Related papers (2024-10-21T05:11:19Z)
Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models [16.938267820586024]
We propose Faster-GCG, an efficient adversarial jailbreak method by delving deep into the design of GCG. Experiments demonstrate that Faster-GCG can surpass the original GCG with only 1/10 of the computational cost.
arXiv Detail & Related papers (2024-10-20T11:27:41Z)
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation [42.797865918373326]
We study the vulnerabilities of transformer-based Large Language Models (LLMs) to jailbreaking attacks. We introduce an enhanced method that manipulates models' attention scores to facilitate jailbreaking. Our strategy also demonstrates robust attack transferability against both unseen harmful goals and black-box LLMs.
arXiv Detail & Related papers (2024-10-11T17:55:09Z)
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques. We propose three comprehensive, automated, and logical frameworks. We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z)
Automatic Jailbreaking of the Text-to-Image Generative AI Systems [76.9697122883554]
We study the safety of the commercial T2I generation systems, such as ChatGPT, Copilot, and Gemini, on copyright infringement with naive prompts. We propose a stronger automated jailbreaking pipeline for T2I generation systems, which produces prompts that bypass their safety guards. Our framework successfully jailbreaks the ChatGPT with 11.0% block rate, making it generate copyrighted contents in 76% of the time.
arXiv Detail & Related papers (2024-05-26T13:32:24Z)
Boosting Jailbreak Attack with Momentum [5.047814998088682]
Large Language Models (LLMs) have achieved remarkable success across diverse tasks, yet they remain vulnerable to adversarial attacks. We introduce the textbfMomentum textbfAccelerated GtextbfCG (textbfMAC) attack, which incorporates a momentum term into the gradient.
arXiv Detail & Related papers (2024-05-02T12:18:14Z)
PAL: Proxy-Guided Black-Box Attack on Large Language Models [55.57987172146731]
Large Language Models (LLMs) have surged in popularity in recent months, but they have demonstrated capabilities to generate harmful content when manipulated. We introduce the Proxy-Guided Attack on LLMs (PAL), the first optimization-based attack on LLMs in a black-box query-only setting. Our attack achieves 84% attack success rate (ASR) on GPT-3.5-Turbo and 48% on Llama-2-7B, compared to 4% for the current state of the art.
arXiv Detail & Related papers (2024-02-15T02:54:49Z)
Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks. Existing jailbreaking methods are computationally costly. We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.