Related papers: Fast Adversarial Attacks on Language Models In One GPU Minute

Fast Adversarial Attacks on Language Models In One GPU Minute

URL: http://arxiv.org/abs/2402.15570v1
Date: Fri, 23 Feb 2024 19:12:53 GMT
Title: Fast Adversarial Attacks on Language Models In One GPU Minute
Authors: Vinu Sankar Sadasivan, Shoumik Saha, Gaurang Sriramanan, Priyatham Kattakinda, Atoosa Chegini, Soheil Feizi
Abstract summary: We introduce a novel class of fast, beam search-based adversarial attack (BEAST) for Language Models (LMs) BEAST employs interpretable parameters, enabling attackers to balance between attack speed, success rate, and the readability of adversarial prompts. Our gradient-free targeted attack can jailbreak aligned LMs with high attack success rates within one minute.
Score: 49.615024989416355
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we introduce a novel class of fast, beam search-based adversarial attack (BEAST) for Language Models (LMs). BEAST employs interpretable parameters, enabling attackers to balance between attack speed, success rate, and the readability of adversarial prompts. The computational efficiency of BEAST facilitates us to investigate its applications on LMs for jailbreaking, eliciting hallucinations, and privacy attacks. Our gradient-free targeted attack can jailbreak aligned LMs with high attack success rates within one minute. For instance, BEAST can jailbreak Vicuna-7B-v1.5 under one minute with a success rate of 89% when compared to a gradient-based baseline that takes over an hour to achieve 70% success rate using a single Nvidia RTX A6000 48GB GPU. Additionally, we discover a unique outcome wherein our untargeted attack induces hallucinations in LM chatbots. Through human evaluations, we find that our untargeted attack causes Vicuna-7B-v1.5 to produce ~15% more incorrect outputs when compared to LM outputs in the absence of our attack. We also learn that 22% of the time, BEAST causes Vicuna to generate outputs that are not relevant to the original prompt. Further, we use BEAST to generate adversarial prompts in a few seconds that can boost the performance of existing membership inference attacks for LMs. We believe that our fast attack, BEAST, has the potential to accelerate research in LM security and privacy. Our codebase is publicly available at https://github.com/vinusankars/BEAST.

Related papers

Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts. It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks. Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z)
Denial-of-Service Poisoning Attacks against Large Language Models [64.77355353440691]
LLMs are vulnerable to denial-of-service (DoS) attacks, where spelling errors or non-semantic prompts trigger endless outputs without generating an [EOS] token. We propose poisoning-based DoS attacks for LLMs, demonstrating that injecting a single poisoned sample designed for DoS purposes can break the output length limit.
arXiv Detail & Related papers (2024-10-14T17:39:31Z)
FLRT: Fluent Student-Teacher Redteaming [0.0]
We improve existing algorithms to develop powerful and fluent attacks on safety-tuned models. Our technique centers around a new distillation-based approach that encourages the victim model to emulate a toxified finetune. We achieve attack success rates $>93$% for Llama-2-7B, Llama-3-8B, and Vicuna-7B, while maintaining model-measured perplexity $33$.
arXiv Detail & Related papers (2024-07-24T17:23:18Z)
ImgTrojan: Jailbreaking Vision-Language Models with ONE Image [40.55590043993117]
We propose a novel jailbreaking attack against vision language models (VLMs) A scenario where our poisoned (image, text) data pairs are included in the training data is assumed. By replacing the original textual captions with malicious jailbreak prompts, our method can perform jailbreak attacks with the poisoned images.
arXiv Detail & Related papers (2024-03-05T12:21:57Z)
Does Few-shot Learning Suffer from Backdoor Attacks? [63.9864247424967]
We show that few-shot learning can still be vulnerable to backdoor attacks. Our method demonstrates a high Attack Success Rate (ASR) in FSL tasks with different few-shot learning paradigms. This study reveals that few-shot learning still suffers from backdoor attacks, and its security should be given attention.
arXiv Detail & Related papers (2023-12-31T06:43:36Z)
Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization [98.18718484152595]
We propose to integrate goal prioritization at both training and inference stages to counteract the intrinsic conflict between the goals of being helpful and ensuring safety. Our work thus contributes to the comprehension of jailbreaking attacks and defenses, and sheds light on the relationship between LLMs' capability and safety.
arXiv Detail & Related papers (2023-11-15T16:42:29Z)
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs) Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z)
Apple of Sodom: Hidden Backdoors in Superior Sentence Embeddings via Contrastive Learning [17.864914834411092]
We present the first backdoor attack framework, BadCSE, for state-of-the-art sentence embeddings. We evaluate BadCSE on both STS tasks and other downstream tasks.
arXiv Detail & Related papers (2022-10-20T08:19:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.