UOR: Universal Backdoor Attacks on Pre-trained Language Models
- URL: http://arxiv.org/abs/2305.09574v1
- Date: Tue, 16 May 2023 16:11:48 GMT
- Title: UOR: Universal Backdoor Attacks on Pre-trained Language Models
- Authors: Wei Du, Peixuan Li, Boqun Li, Haodong Zhao, Gongshen Liu
- Abstract summary: Most existing backdoor attacks against pre-trained language models (PLMs) are un-targeted and task-specific.
We first summarize the requirements that a more threatening backdoor attack against PLMs should satisfy, and then propose a new backdoor attack method called UOR.
Specifically, we define poisoned supervised contrastive learning which can automatically learn the more uniform and universal output representations of triggers for various PLMs.
- Score: 9.968755838867178
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Backdoors implanted in pre-trained language models (PLMs) can be transferred
to various downstream tasks, which exposes a severe security threat. However,
most existing backdoor attacks against PLMs are un-targeted and task-specific.
Few targeted and task-agnostic methods use manually pre-defined triggers and
output representations, which prevent the attacks from being more effective and
general. In this paper, we first summarize the requirements that a more
threatening backdoor attack against PLMs should satisfy, and then propose a new
backdoor attack method called UOR, which breaks the bottleneck of the previous
approach by turning manual selection into automatic optimization. Specifically,
we define poisoned supervised contrastive learning which can automatically
learn the more uniform and universal output representations of triggers for
various PLMs. Moreover, we use gradient search to select appropriate trigger
words which can be adaptive to different PLMs and vocabularies. Experiments
show that our method can achieve better attack performance on various text
classification tasks compared to manual methods. Further, we tested our method
on PLMs with different architectures, different usage paradigms, and more
difficult tasks, which demonstrated the universality of our method.
Related papers
- Defense Against Prompt Injection Attack by Leveraging Attack Techniques [66.65466992544728]
Large language models (LLMs) have achieved remarkable performance across various natural language processing (NLP) tasks.
As LLMs continue to evolve, new vulnerabilities, especially prompt injection attacks arise.
Recent attack methods leverage LLMs' instruction-following abilities and their inabilities to distinguish instructions injected in the data content.
arXiv Detail & Related papers (2024-11-01T09:14:21Z) - Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues [88.96201324719205]
This study exposes the safety vulnerabilities of Large Language Models (LLMs) in multi-turn interactions.
We introduce ActorAttack, a novel multi-turn attack method inspired by actor-network theory.
arXiv Detail & Related papers (2024-10-14T16:41:49Z) - Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context [49.13497493053742]
This research explores converting a nonsensical suffix attack into a sensible prompt via a situation-driven contextual re-writing.
We combine an independent, meaningful adversarial insertion and situations derived from movies to check if this can trick an LLM.
Our approach demonstrates that a successful situation-driven attack can be executed on both open-source and proprietary LLMs.
arXiv Detail & Related papers (2024-07-19T19:47:26Z) - Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models.
We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks.
We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z) - ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text.
Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z) - VL-Trojan: Multimodal Instruction Backdoor Attacks against
Autoregressive Visual Language Models [65.23688155159398]
Autoregressive Visual Language Models (VLMs) showcase impressive few-shot learning capabilities in a multimodal context.
Recently, multimodal instruction tuning has been proposed to further enhance instruction-following abilities.
Adversaries can implant a backdoor by injecting poisoned samples with triggers embedded in instructions or images.
We propose a multimodal instruction backdoor attack, namely VL-Trojan.
arXiv Detail & Related papers (2024-02-21T14:54:30Z) - Modeling Adversarial Attack on Pre-trained Language Models as Sequential
Decision Making [10.425483543802846]
adversarial attack task has found that pre-trained language models (PLMs) are vulnerable to small perturbations.
In this paper, we model the adversarial attack task on PLMs as a sequential decision-making problem.
We propose to use reinforcement learning to find an appropriate sequential attack path to generate adversaries, named SDM-Attack.
arXiv Detail & Related papers (2023-05-27T10:33:53Z) - Backdoor Pre-trained Models Can Transfer to All [33.720258110911274]
We propose a new approach to map the inputs containing triggers directly to a predefined output representation of pre-trained NLP models.
In light of the unique properties of triggers in NLP, we propose two new metrics to measure the performance of backdoor attacks.
arXiv Detail & Related papers (2021-10-30T07:11:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.