PRP: Propagating Universal Perturbations to Attack Large Language Model
Guard-Rails
- URL: http://arxiv.org/abs/2402.15911v1
- Date: Sat, 24 Feb 2024 21:27:13 GMT
- Title: PRP: Propagating Universal Perturbations to Attack Large Language Model
Guard-Rails
- Authors: Neal Mangaokar, Ashish Hooda, Jihye Choi, Shreyas Chandrashekaran,
Kassem Fawaz, Somesh Jha, Atul Prakash
- Abstract summary: Large language models (LLMs) are typically aligned to be harmless to humans.
Recent work has shown that such models are susceptible to automated jailbreak attacks that induce them to generate harmful content.
Our key contribution is to show a novel attack strategy, PRP, that is successful against several open-source (e.g., Llama 2) and closed-source (e.g., GPT 3.5) implementations of Guard Models.
- Score: 26.090757124460552
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are typically aligned to be harmless to humans.
Unfortunately, recent work has shown that such models are susceptible to
automated jailbreak attacks that induce them to generate harmful content. More
recent LLMs often incorporate an additional layer of defense, a Guard Model,
which is a second LLM that is designed to check and moderate the output
response of the primary LLM. Our key contribution is to show a novel attack
strategy, PRP, that is successful against several open-source (e.g., Llama 2)
and closed-source (e.g., GPT 3.5) implementations of Guard Models. PRP
leverages a two step prefix-based attack that operates by (a) constructing a
universal adversarial prefix for the Guard Model, and (b) propagating this
prefix to the response. We find that this procedure is effective across
multiple threat models, including ones in which the adversary has no access to
the Guard Model at all. Our work suggests that further advances are required on
defenses and Guard Models before they can be considered effective.
Related papers
- Merge Hijacking: Backdoor Attacks to Model Merging of Large Language Models [48.36985844329255]
Model merging for Large Language Models (LLMs) directly fuses the parameters of different models finetuned on various tasks.<n>Due to potential vulnerabilities in models available on open-source platforms, model merging is susceptible to backdoor attacks.<n>We propose Merge Hijacking, the first backdoor attack targeting model merging in LLMs.
arXiv Detail & Related papers (2025-05-29T15:37:23Z) - Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models [69.11679786018206]
Supervised fine-tuning (SFT) aligns large language models with human intent by training them on labeled task-specific data.<n>Recent studies have shown that malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer pairs.<n>We propose a novel clean-data backdoor attack for jailbreaking LLMs.
arXiv Detail & Related papers (2025-05-23T08:13:59Z) - Jailbreaking? One Step Is Enough! [6.142918017301964]
Large language models (LLMs) excel in various tasks but remain vulnerable to jailbreak attacks, where adversaries manipulate prompts to generate harmful outputs.
We propose a Reverse Embedded Defense Attack (REDA) mechanism that disguises the attack intention as the "defense" intention.
To enhance the model's confidence and guidance in "defensive" intentions, we adopt in-context learning (ICL) with a small number of attack examples.
arXiv Detail & Related papers (2024-12-17T07:33:41Z) - FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks [7.31505609352525]
Defense in large language models (LLMs) is crucial to counter the numerous attackers exploiting these systems to generate harmful content.
We propose a moving target defense approach that alters decoding hyper parameters to enhance model robustness.
Our results demonstrate that our defense is the most effective against jailbreak attacks in three of the models tested.
arXiv Detail & Related papers (2024-12-10T17:02:28Z) - DROJ: A Prompt-Driven Attack against Large Language Models [0.0]
Large Language Models (LLMs) have demonstrated exceptional capabilities across various natural language processing tasks.
Despite massive alignment efforts, LLMs remain susceptible to adversarial jailbreak attacks.
We introduce a novel approach, Directed Rrepresentation Optimization Jailbreak (DROJ)
arXiv Detail & Related papers (2024-11-14T01:48:08Z) - A Realistic Threat Model for Large Language Model Jailbreaks [87.64278063236847]
In this work, we propose a unified threat model for the principled comparison of jailbreak attacks.
Our threat model combines constraints in perplexity, measuring how far a jailbreak deviates from natural text.
We adapt popular attacks to this new, realistic threat model, with which we, for the first time, benchmark these attacks on equal footing.
arXiv Detail & Related papers (2024-10-21T17:27:01Z) - Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks [27.11523234556414]
We propose a plug-and-play and easy-to-deploy jailbreak defense framework, namely Prefix Guidance (PG)
PG guides the model to identify harmful prompts by directly setting the first few tokens of the model's output.
We demonstrate the effectiveness of PG across three models and five attack methods.
arXiv Detail & Related papers (2024-08-15T14:51:32Z) - SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner [21.414701448926614]
This paper introduces a generic LLM jailbreak defense framework called SelfDefend.
We empirically validate using the commonly used GPT-3.5/4 models across all major jailbreak attacks.
These models outperform six state-of-the-art defenses and match the performance of GPT-4-based SelfDefend.
arXiv Detail & Related papers (2024-06-08T15:45:31Z) - Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs)
Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs.
Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z) - TrojFM: Resource-efficient Backdoor Attacks against Very Large Foundation Models [69.37990698561299]
TrojFM is a novel backdoor attack tailored for very large foundation models.
Our approach injects backdoors by fine-tuning only a very small proportion of model parameters.
We demonstrate that TrojFM can launch effective backdoor attacks against widely used large GPT-style models.
arXiv Detail & Related papers (2024-05-27T03:10:57Z) - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs)
Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z) - Baseline Defenses for Adversarial Attacks Against Aligned Language
Models [109.75753454188705]
Recent work shows that text moderations can produce jailbreaking prompts that bypass defenses.
We look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training.
We find that the weakness of existing discretes for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs.
arXiv Detail & Related papers (2023-09-01T17:59:44Z) - Universal and Transferable Adversarial Attacks on Aligned Language
Models [118.41733208825278]
We propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors.
Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable.
arXiv Detail & Related papers (2023-07-27T17:49:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.