Universal and Transferable Adversarial Attacks on Aligned Language
Models
- URL: http://arxiv.org/abs/2307.15043v2
- Date: Wed, 20 Dec 2023 20:48:57 GMT
- Title: Universal and Transferable Adversarial Attacks on Aligned Language
Models
- Authors: Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter,
Matt Fredrikson
- Abstract summary: We propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors.
Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable.
- Score: 118.41733208825278
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Because "out-of-the-box" large language models are capable of generating a
great deal of objectionable content, recent work has focused on aligning these
models in an attempt to prevent undesirable generation. While there has been
some success at circumventing these measures -- so-called "jailbreaks" against
LLMs -- these attacks have required significant human ingenuity and are brittle
in practice. In this paper, we propose a simple and effective attack method
that causes aligned language models to generate objectionable behaviors.
Specifically, our approach finds a suffix that, when attached to a wide range
of queries for an LLM to produce objectionable content, aims to maximize the
probability that the model produces an affirmative response (rather than
refusing to answer). However, instead of relying on manual engineering, our
approach automatically produces these adversarial suffixes by a combination of
greedy and gradient-based search techniques, and also improves over past
automatic prompt generation methods.
Surprisingly, we find that the adversarial prompts generated by our approach
are quite transferable, including to black-box, publicly released LLMs.
Specifically, we train an adversarial attack suffix on multiple prompts (i.e.,
queries asking for many different types of objectionable content), as well as
multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting
attack suffix is able to induce objectionable content in the public interfaces
to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat,
Pythia, Falcon, and others. In total, this work significantly advances the
state-of-the-art in adversarial attacks against aligned language models,
raising important questions about how such systems can be prevented from
producing objectionable information. Code is available at
github.com/llm-attacks/llm-attacks.
Related papers
- DROJ: A Prompt-Driven Attack against Large Language Models [0.0]
Large Language Models (LLMs) have demonstrated exceptional capabilities across various natural language processing tasks.
Despite massive alignment efforts, LLMs remain susceptible to adversarial jailbreak attacks.
We introduce a novel approach, Directed Rrepresentation Optimization Jailbreak (DROJ)
arXiv Detail & Related papers (2024-11-14T01:48:08Z) - Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large Language Models? [3.258629327038072]
Large Language Models (LLMs) have demonstrated impressive capabilities in natural language tasks.
Yet, the potential for generating harmful content through these models seems to persist.
This paper explores the concept of jailbreaking LLMs-reversing their alignment through adversarial triggers.
arXiv Detail & Related papers (2024-08-05T17:27:29Z) - Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs [9.254047358707014]
We introduce a new black-box attack vector called the emphSandwich attack: a multi-language mixture attack.
Our experiments with five different models, namely Google's Bard, Gemini Pro, LLaMA-2-70-B-Chat, GPT-3.5-Turbo, GPT-4, and Claude-3-OPUS, show that this attack vector can be used by adversaries to generate harmful responses.
arXiv Detail & Related papers (2024-04-09T18:29:42Z) - AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks.
Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z) - ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text.
Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z) - Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements.
We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z) - AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large
Language Models [55.748851471119906]
Safety alignment of Large Language Models (LLMs) can be compromised with manual jailbreak attacks and (automatic) adversarial attacks.
Recent studies suggest that defending against these attacks is possible: adversarial attacks generate unlimited but unreadable gibberish prompts, detectable by perplexity-based filters.
We introduce AutoDAN, an interpretable, gradient-based adversarial attack that merges the strengths of both attack types.
arXiv Detail & Related papers (2023-10-23T17:46:07Z) - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs)
Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.