Related papers: Adversarial Suffixes May Be Features Too!

Adversarial Suffixes May Be Features Too!

URL: http://arxiv.org/abs/2410.00451v2
Date: Sat, 5 Oct 2024 17:14:09 GMT
Title: Adversarial Suffixes May Be Features Too!
Authors: Wei Zhao, Zhe Li, Yige Li, Jun Sun,
Abstract summary: We show that adversarial suffixes generated from jailbreak attacks may contain meaningful features. This highlights the critical risk posed by dominating benign features in the training data.
Score: 10.463762448166714
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite significant ongoing efforts in safety alignment, large language models (LLMs) such as GPT-4 and LLaMA 3 remain vulnerable to jailbreak attacks that can induce harmful behaviors, including those triggered by adversarial suffixes. Building on prior research, we hypothesize that these adversarial suffixes are not mere bugs but may represent features that can dominate the LLM's behavior. To evaluate this hypothesis, we conduct several experiments. First, we demonstrate that benign features can be effectively made to function as adversarial suffixes, i.e., we develop a feature extraction method to extract sample-agnostic features from benign dataset in the form of suffixes and show that these suffixes may effectively compromise safety alignment. Second, we show that adversarial suffixes generated from jailbreak attacks may contain meaningful features, i.e., appending the same suffix to different prompts results in responses exhibiting specific characteristics. Third, we show that such benign-yet-safety-compromising features can be easily introduced through fine-tuning using only benign datasets, i.e., even in the absence of harmful content. This highlights the critical risk posed by dominating benign features in the training data and calls for further research to reinforce LLM safety alignment. Our code and data is available at \url{https://github.com/suffix-maybe-feature/adver-suffix-maybe-features}.

Related papers

xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking [32.89084809038529]
Black-box jailbreak is an attack where crafted prompts bypass safety mechanisms in large language models. We propose a novel black-box jailbreak method leveraging reinforcement learning (RL) We introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success.
arXiv Detail & Related papers (2025-01-28T06:07:58Z)
Can You Trust Your Metric? Automatic Concatenation-Based Tests for Metric Validity [9.355471292024061]
GPT-based harmfulness detection metrics exhibit decision-flipping phenomenon. Even an advanced metric like GPT-4o is highly sensitive to input order.
arXiv Detail & Related papers (2024-08-22T09:57:57Z)
Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation [44.09578786678573]
Large Language Models (LLMs) are implicit troublemakers. LLMs could be used to gather harmful data or launch covert attacks. We name this decoding strategy: Jailbreak Value Decoding (JVD)
arXiv Detail & Related papers (2024-08-20T09:11:21Z)
Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions [14.881201844063616]
We propose Alignment-Enhanced Decoding (AED), a novel defense that employs adaptive decoding to address the root causes of jailbreak issues. We conduct experiments across five models and four common jailbreaks, with the results validating the effectiveness of our approach.
arXiv Detail & Related papers (2024-08-14T16:51:21Z)
Safe Training with Sensitive In-domain Data: Leveraging Data Fragmentation To Mitigate Linkage Attacks [2.8186733524862158]
Current text generation models are trained using real data which can potentially contain sensitive information. We propose a safer alternative which sees fragmented data in the form of domain-specific short phrases randomly grouped together.
arXiv Detail & Related papers (2024-04-30T12:09:55Z)
Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing [107.97160023681184]
Aligned large language models (LLMs) are vulnerable to jailbreaking attacks. We propose SEMANTICSMOOTH, a smoothing-based defense that aggregates predictions of semantically transformed copies of a given input prompt.
arXiv Detail & Related papers (2024-02-25T20:36:03Z)
DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers [74.7446827091938]
We introduce an automatic prompt textbfDecomposition and textbfReconstruction framework for jailbreak textbfAttack (DrAttack) DrAttack includes three key components: (a) Decomposition' of the original prompt into sub-prompts, (b) Reconstruction' of these sub-prompts implicitly by in-context learning with semantically similar but harmless reassembling demo, and (c) a Synonym Search' of sub-prompts, aiming to find sub-prompts' synonyms that maintain the original intent while
arXiv Detail & Related papers (2024-02-25T17:43:29Z)
Query-Based Adversarial Prompt Generation [72.06860443442429]
We build adversarial examples that cause an aligned language model to emit harmful strings. We validate our attack on GPT-3.5 and OpenAI's safety classifier.
arXiv Detail & Related papers (2024-02-19T18:01:36Z)
Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack [7.653580388741887]
A few harmful data uploaded by users can easily trick the finetuning to produce an alignment-broken model. We propose Vaccine, a perturbation-aware alignment technique to mitigate the security risk of users finetuning.
arXiv Detail & Related papers (2024-02-02T02:56:50Z)
Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks. Existing jailbreaking methods are computationally costly. We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z)
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks. This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs. We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z)
Certifying LLM Safety against Adversarial Prompting [75.19953634352258]
Large language models (LLMs) are vulnerable to adversarial attacks that add malicious tokens to an input prompt. We introduce erase-and-check, the first framework for defending against adversarial prompts with certifiable safety guarantees.
arXiv Detail & Related papers (2023-09-06T04:37:20Z)
On the Exploitability of Instruction Tuning [103.8077787502381]
In this work, we investigate how an adversary can exploit instruction tuning to change a model's behavior. We propose textitAutoPoison, an automated data poisoning pipeline. Our results show that AutoPoison allows an adversary to change a model's behavior by poisoning only a small fraction of data.
arXiv Detail & Related papers (2023-06-28T17:54:04Z)
Can contrastive learning avoid shortcut solutions? [88.249082564465]
implicit feature modification (IFM) is a method for altering positive and negative samples in order to guide contrastive models towards capturing a wider variety of predictive features. IFM reduces feature suppression, and as a result improves performance on vision and medical imaging tasks.
arXiv Detail & Related papers (2021-06-21T16:22:43Z)
Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models [27.100909068228813]
Recent studies have revealed a security threat to natural language processing (NLP) models, called the Backdoor Attack. In this paper, we find that it is possible to hack the model in a data-free way by modifying one single word embedding vector. Experimental results on sentiment analysis and sentence-pair classification tasks show that our method is more efficient and stealthier.
arXiv Detail & Related papers (2021-03-29T12:19:45Z)
Adversarial Semantic Collisions [129.55896108684433]
We study semantic collisions: texts that are semantically unrelated but judged as similar by NLP models. We develop gradient-based approaches for generating semantic collisions. We show how to generate semantic collisions that evade perplexity-based filtering.
arXiv Detail & Related papers (2020-11-09T20:42:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.