Related papers: GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs

GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs

URL: http://arxiv.org/abs/2411.14133v1
Date: Thu, 21 Nov 2024 14:00:01 GMT
Title: GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs
Authors: Advik Raj Basani, Xiao Zhang,
Abstract summary: We introduce Generative Adversarial Suffix Prompter (GASP) to improve adversarial suffix creation in a fully black-box setting. Our experiments show that GASP can generate natural jailbreak prompts, significantly improving attack success rates, reducing training times, and accelerating inference speed.
Score: 3.096869664709865
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have shown impressive proficiency across a range of natural language processing tasks yet remain vulnerable to adversarial prompts, known as jailbreak attacks, carefully designed to elicit harmful responses from LLMs. Traditional methods rely on manual heuristics, which suffer from limited generalizability. While being automatic, optimization-based attacks often produce unnatural jailbreak prompts that are easy to detect by safety filters or require high computational overhead due to discrete token optimization. Witnessing the limitations of existing jailbreak methods, we introduce Generative Adversarial Suffix Prompter (GASP), a novel framework that combines human-readable prompt generation with Latent Bayesian Optimization (LBO) to improve adversarial suffix creation in a fully black-box setting. GASP leverages LBO to craft adversarial suffixes by efficiently exploring continuous embedding spaces, gradually optimizing the model to improve attack efficacy while balancing prompt coherence through a targeted iterative refinement procedure. Our experiments show that GASP can generate natural jailbreak prompts, significantly improving attack success rates, reducing training times, and accelerating inference speed, thus making it an efficient and scalable solution for red-teaming LLMs.

Related papers

VERA: Variational Inference Framework for Jailbreaking Large Language Models [15.03256687264469]
API-only access to state-of-the-art LLMs highlights the need for effective black-box jailbreak methods.<n>We introduce VERA: Variational infErence fRamework for jAilbreaking.
arXiv Detail & Related papers (2025-06-27T22:22:00Z)
LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs [13.432303050813864]
We introduce LARGO, a novel latent self-reflection attack that generates fluent jailbreaking prompts.<n>On benchmarks like AdvBench and JailbreakBench, LARGO surpasses leading jailbreaking techniques, including AutoDAN, by 44 points in attack success rate.
arXiv Detail & Related papers (2025-05-16T04:12:16Z)
Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs [83.11815479874447]
We propose a novel jailbreak attack framework, inspired by cognitive decomposition and biases in human cognition.<n>We employ cognitive decomposition to reduce the complexity of malicious prompts and relevance bias to reorganize prompts.<n>We also introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm.
arXiv Detail & Related papers (2025-05-03T05:28:11Z)
Prefill-Based Jailbreak: A Novel Approach of Bypassing LLM Safety Boundary [2.4329261266984346]
Large Language Models (LLMs) are designed to generate helpful and safe content. adversarial attacks, commonly referred to as jailbreak, can bypass their safety protocols. We introduce a novel jailbreak attack method that leverages the prefilling feature of LLMs.
arXiv Detail & Related papers (2025-04-28T07:38:43Z)
xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking [32.89084809038529]
Black-box jailbreak is an attack where crafted prompts bypass safety mechanisms in large language models. We propose a novel black-box jailbreak method leveraging reinforcement learning (RL) We introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success.
arXiv Detail & Related papers (2025-01-28T06:07:58Z)
LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models [59.29840790102413]
Existing jailbreak attacks are primarily based on opaque optimization techniques and gradient search methods. We propose LLM-Virus, a jailbreak attack method based on evolutionary algorithm, termed evolutionary jailbreak. Our results show that LLM-Virus achieves competitive or even superior performance compared to existing attack methods.
arXiv Detail & Related papers (2024-12-28T07:48:57Z)
LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds [98.20826635707341]
LIAR (Leveraging Inference time Alignment to jailbReak) is a fast and efficient best-of-N approach tailored for jailbreak attacks. Our results demonstrate that a best-of-N approach is a simple yet highly effective strategy for evaluating the robustness of aligned LLMs.
arXiv Detail & Related papers (2024-12-06T18:02:59Z)
Buckle Up: Robustifying LLMs at Every Customization Stage via Data Curation [20.176424063726277]
Large language models (LLMs) are extensively adapted for downstream applications through a process known as "customization" Recent studies have revealed a vulnerability that tuning LLMs with malicious samples can compromise their robustness and amplify harmful content, an attack known as "jailbreaking"
arXiv Detail & Related papers (2024-10-03T05:24:38Z)
Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as Optimizer [33.67942887761857]
We present ECLIPSE, a novel and efficient black-box jailbreaking method utilizing optimizable suffixes. We employ task prompts to translate jailbreaking goals into natural language instructions, which guides the LLM to generate adversarial suffixes for malicious queries. ECLIPSE achieves an average attack success rate (ASR) of 0.92 across three open-source LLMs and GPT-3.5-Turbo, significantly surpassing GCG in 2.4 times.
arXiv Detail & Related papers (2024-08-21T03:35:24Z)
Jailbreaking Large Language Models Through Alignment Vulnerabilities in Out-of-Distribution Settings [57.136748215262884]
We introduce ObscurePrompt for jailbreaking LLMs, inspired by the observed fragile alignments in Out-of-Distribution (OOD) data. We first formulate the decision boundary in the jailbreaking process and then explore how obscure text affects LLM's ethical decision boundary. Our approach substantially improves upon previous methods in terms of attack effectiveness, maintaining efficacy against two prevalent defense mechanisms.
arXiv Detail & Related papers (2024-06-19T16:09:58Z)
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs [51.217126257318924]
We present a novel method that uses another Large Language Models, called the AdvPrompter, to generate human-readable adversarial prompts in seconds. We train the AdvPrompter using a novel algorithm that does not require access to the gradients of the TargetLLM. The trained AdvPrompter generates suffixes that veil the input instruction without changing its meaning, such that the TargetLLM is lured to give a harmful response.
arXiv Detail & Related papers (2024-04-21T22:18:13Z)
Distract Large Language Models for Automatic Jailbreak Attack [8.364590541640482]
We propose a novel black-box jailbreak framework for automated red teaming of Large language models. We designed malicious content concealing and memory reframing with an iterative optimization algorithm to jailbreak LLMs.
arXiv Detail & Related papers (2024-03-13T11:16:43Z)
ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text. Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z)
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses. adversarial prompts known as 'jailbreaks' can circumvent safeguards. We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z)
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models [55.748851471119906]
Safety alignment of Large Language Models (LLMs) can be compromised with manual jailbreak attacks and (automatic) adversarial attacks. Recent studies suggest that defending against these attacks is possible: adversarial attacks generate unlimited but unreadable gibberish prompts, detectable by perplexity-based filters. We introduce AutoDAN, an interpretable, gradient-based adversarial attack that merges the strengths of both attack types.
arXiv Detail & Related papers (2023-10-23T17:46:07Z)
Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks. We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z)
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [54.95912006700379]
We introduce AutoDAN, a novel jailbreak attack against aligned Large Language Models. AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm.
arXiv Detail & Related papers (2023-10-03T19:44:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.