Related papers: Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models

Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models

URL: http://arxiv.org/abs/2410.15362v1
Date: Sun, 20 Oct 2024 11:27:41 GMT
Title: Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models
Authors: Xiao Li, Zhuhong Li, Qiongxiu Li, Bingze Lee, Jinghao Cui, Xiaolin Hu,
Abstract summary: We propose Faster-GCG, an efficient adversarial jailbreak method by delving deep into the design of GCG. Experiments demonstrate that Faster-GCG can surpass the original GCG with only 1/10 of the computational cost.
Score: 16.938267820586024
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Aligned Large Language Models (LLMs) have demonstrated remarkable performance across various tasks. However, LLMs remain susceptible to jailbreak adversarial attacks, where adversaries manipulate prompts to elicit malicious responses that aligned LLMs should have avoided. Identifying these vulnerabilities is crucial for understanding the inherent weaknesses of LLMs and preventing their potential misuse. One pioneering work in jailbreaking is the GCG attack, a discrete token optimization algorithm that seeks to find a suffix capable of jailbreaking aligned LLMs. Despite the success of GCG, we find it suboptimal, requiring significantly large computational costs, and the achieved jailbreaking performance is limited. In this work, we propose Faster-GCG, an efficient adversarial jailbreak method by delving deep into the design of GCG. Experiments demonstrate that Faster-GCG can surpass the original GCG with only 1/10 of the computational cost, achieving significantly higher attack success rates on various open-source aligned LLMs. In addition, We demonstrate that Faster-GCG exhibits improved attack transferability when testing on closed-sourced LLMs such as ChatGPT.

Related papers

Geneshift: Impact of different scenario shift on Jailbreaking LLM [55.26229741296822]
We propose a black-box jailbreak attack termed GeneShift, by using a genetic algorithm to optimize the scenario shifts. We show that GeneShift increases the jailbreak success rate from 0% to 60% when direct prompting alone would fail.
arXiv Detail & Related papers (2025-04-10T20:02:35Z)
Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models [16.83476701024932]
Greedy Coordinate Gradient (GCG) method has demonstrated ability to automatically generate adversarial suffixes that jailbreak state-of-the-art LLMs. We propose the Model Attack Gradient Index GCG (MAGIC) that addresses the Indirect Effect by exploiting the gradient information of the suffix tokens. Experiments on AdvBench show that MAGIC achieves up to a 1.5x speedup, while maintaining Attack Success Rates (ASR) on par or even higher.
arXiv Detail & Related papers (2024-12-11T18:37:56Z)
LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds [98.20826635707341]
LIAR (Leveraging Inference time Alignment to jailbReak) is a fast and efficient best-of-N approach tailored for jailbreak attacks. Our results demonstrate that a best-of-N approach is a simple yet highly effective strategy for evaluating the robustness of aligned LLMs.
arXiv Detail & Related papers (2024-12-06T18:02:59Z)
AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts [10.536276489213497]
A generative model can quickly produce numerous customizable gibberish adversarial suffixes for any harmful query. We introduce AmpleGCG-Plus, an enhanced version that achieves better performance in fewer attempts. We jailbreak the newer GPT-4o series of models at similar rates to GPT-4, and, uncovers vulnerabilities against the recently proposed circuit breakers defense.
arXiv Detail & Related papers (2024-10-29T15:40:07Z)
Boosting Jailbreak Transferability for Large Language Models [10.884050438726215]
We propose a scenario induction template, optimized suffix selection, and the integration of re-suffix attack mechanism to reduce inconsistent outputs. Our approach has shown superior performance in extensive experiments across various benchmarks, achieving nearly 100% success rates in both attack execution and transferability.
arXiv Detail & Related papers (2024-10-21T05:11:19Z)
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts. It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks. Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z)
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation [42.797865918373326]
We study the vulnerabilities of transformer-based Large Language Models (LLMs) to jailbreaking attacks. We introduce an enhanced method that manipulates models' attention scores to facilitate jailbreaking. Our strategy also demonstrates robust attack transferability against both unseen harmful goals and black-box LLMs.
arXiv Detail & Related papers (2024-10-11T17:55:09Z)
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models [78.32176751215073]
Greedy Coordinate Gradient (GCG) attack's success has led to a growing interest in the study of optimization-based jailbreaking techniques. We present several improved (empirical) techniques for optimization-based jailbreaks like GCG. The results demonstrate that our improved techniques can help GCG outperform state-of-the-art jailbreaking attacks and achieve nearly 100% attack success rate.
arXiv Detail & Related papers (2024-05-31T17:07:15Z)
AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs [11.094625711201648]
GCGcitepzou2023universal proposes a discrete token optimization algorithm and selects the single suffix with the lowest loss to successfully jailbreak aligned LLMs. We utilize successful suffixes as training data to learn a generative model, named AmpleGCG, which captures the distribution of adversarial suffixes given a harmful query. A AmpleGCG model can generate 200 adversarial suffixes for one harmful query in only 4 seconds, rendering it more challenging to defend.
arXiv Detail & Related papers (2024-04-11T17:05:50Z)
PAL: Proxy-Guided Black-Box Attack on Large Language Models [55.57987172146731]
Large Language Models (LLMs) have surged in popularity in recent months, but they have demonstrated capabilities to generate harmful content when manipulated. We introduce the Proxy-Guided Attack on LLMs (PAL), the first optimization-based attack on LLMs in a black-box query-only setting. Our attack achieves 84% attack success rate (ASR) on GPT-3.5-Turbo and 48% on Llama-2-7B, compared to 4% for the current state of the art.
arXiv Detail & Related papers (2024-02-15T02:54:49Z)
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses. adversarial prompts known as 'jailbreaks' can circumvent safeguards. We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z)
Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks. We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z)
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs) Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.