Soft Prompt Threats: Attacking Safety Alignment and Unlearning in
Open-Source LLMs through the Embedding Space
- URL: http://arxiv.org/abs/2402.09063v1
- Date: Wed, 14 Feb 2024 10:20:03 GMT
- Title: Soft Prompt Threats: Attacking Safety Alignment and Unlearning in
Open-Source LLMs through the Embedding Space
- Authors: Leo Schwinn and David Dobre and Sophie Xhonneux and Gauthier Gidel and
Stephan Gunnemann
- Abstract summary: We propose the embedding space attack, which directly attacks the continuous embedding representation of input tokens.
We show that embedding space attacks circumvent model alignments and trigger harmful behaviors more efficiently than discrete attacks or model fine-tuning.
Our findings highlight embedding space attacks as an important threat model in open-source LLMs.
- Score: 19.426618259383126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current research in adversarial robustness of LLMs focuses on discrete input
manipulations in the natural language space, which can be directly transferred
to closed-source models. However, this approach neglects the steady progression
of open-source models. As open-source models advance in capability, ensuring
their safety also becomes increasingly imperative. Yet, attacks tailored to
open-source LLMs that exploit full model access remain largely unexplored. We
address this research gap and propose the embedding space attack, which
directly attacks the continuous embedding representation of input tokens. We
find that embedding space attacks circumvent model alignments and trigger
harmful behaviors more efficiently than discrete attacks or model fine-tuning.
Furthermore, we present a novel threat model in the context of unlearning and
show that embedding space attacks can extract supposedly deleted information
from unlearned LLMs across multiple datasets and models. Our findings highlight
embedding space attacks as an important threat model in open-source LLMs.
Trigger Warning: the appendix contains LLM-generated text with violence and
harassment.
Related papers
- Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models [8.024771725860127]
Large Language Models (LLMs) remain vulnerable to jailbreak attacks that bypass their safety mechanisms.
We introduce a novel scalable jailbreak attack that preempts the activation of an LLM's safety policies by occupying its computational resources.
arXiv Detail & Related papers (2024-10-05T15:10:01Z) - Detecting AI Flaws: Target-Driven Attacks on Internal Faults in Language Models [27.397408870544453]
Large Language Models (LLMs) have become a focal point in the rapidly evolving field of artificial intelligence.
A critical concern is the presence of toxic content within the pre-training corpus of these models, which can lead to the generation of inappropriate outputs.
This paper proposes a target-driven attack paradigm that focuses on directly eliciting the target response instead of optimizing the prompts.
arXiv Detail & Related papers (2024-08-27T08:12:08Z) - A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends [78.3201480023907]
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across a wide range of multimodal understanding and reasoning tasks.
The vulnerability of LVLMs is relatively underexplored, posing potential security risks in daily usage.
In this paper, we provide a comprehensive review of the various forms of existing LVLM attacks.
arXiv Detail & Related papers (2024-07-10T06:57:58Z) - DALD: Improving Logits-based Detector without Logits from Black-box LLMs [56.234109491884126]
Large Language Models (LLMs) have revolutionized text generation, producing outputs that closely mimic human writing.
We present Distribution-Aligned LLMs Detection (DALD), an innovative framework that redefines the state-of-the-art performance in black-box text detection.
DALD is designed to align the surrogate model's distribution with that of unknown target LLMs, ensuring enhanced detection capability and resilience against rapid model iterations.
arXiv Detail & Related papers (2024-06-07T19:38:05Z) - Defending Large Language Models Against Attacks With Residual Stream Activation Analysis [0.0]
Large Language Models (LLMs) are vulnerable to adversarial threats.
This paper presents an innovative defensive strategy, given white box access to an LLM.
We apply a novel methodology for analyzing distinctive activation patterns in the residual streams for attack prompt classification.
arXiv Detail & Related papers (2024-06-05T13:06:33Z) - Prompt Leakage effect and defense strategies for multi-turn LLM interactions [95.33778028192593]
Leakage of system prompts may compromise intellectual property and act as adversarial reconnaissance for an attacker.
We design a unique threat model which leverages the LLM sycophancy effect and elevates the average attack success rate (ASR) from 17.7% to 86.2% in a multi-turn setting.
We measure the mitigation effect of 7 black-box defense strategies, along with finetuning an open-source model to defend against leakage attempts.
arXiv Detail & Related papers (2024-04-24T23:39:58Z) - Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress.
Our investigation exposes a critical oversight in this belief.
By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z) - On the Safety of Open-Sourced Large Language Models: Does Alignment
Really Prevent Them From Being Misused? [49.99955642001019]
We show that open-sourced, aligned large language models could be easily misguided to generate undesired content.
Our key idea is to directly manipulate the generation process of open-sourced LLMs to misguide it to generate undesired content.
arXiv Detail & Related papers (2023-10-02T19:22:01Z) - Not what you've signed up for: Compromising Real-World LLM-Integrated
Applications with Indirect Prompt Injection [64.67495502772866]
Large Language Models (LLMs) are increasingly being integrated into various applications.
We show how attackers can override original instructions and employed controls using Prompt Injection attacks.
We derive a comprehensive taxonomy from a computer security perspective to systematically investigate impacts and vulnerabilities.
arXiv Detail & Related papers (2023-02-23T17:14:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.