Related papers: UOR: Universal Backdoor Attacks on Pre-trained Language Models

UOR: Universal Backdoor Attacks on Pre-trained Language Models

URL: http://arxiv.org/abs/2305.09574v1
Date: Tue, 16 May 2023 16:11:48 GMT
Title: UOR: Universal Backdoor Attacks on Pre-trained Language Models
Authors: Wei Du, Peixuan Li, Boqun Li, Haodong Zhao, Gongshen Liu
Abstract summary: Most existing backdoor attacks against pre-trained language models (PLMs) are un-targeted and task-specific. We first summarize the requirements that a more threatening backdoor attack against PLMs should satisfy, and then propose a new backdoor attack method called UOR. Specifically, we define poisoned supervised contrastive learning which can automatically learn the more uniform and universal output representations of triggers for various PLMs.
Score: 9.968755838867178
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Backdoors implanted in pre-trained language models (PLMs) can be transferred to various downstream tasks, which exposes a severe security threat. However, most existing backdoor attacks against PLMs are un-targeted and task-specific. Few targeted and task-agnostic methods use manually pre-defined triggers and output representations, which prevent the attacks from being more effective and general. In this paper, we first summarize the requirements that a more threatening backdoor attack against PLMs should satisfy, and then propose a new backdoor attack method called UOR, which breaks the bottleneck of the previous approach by turning manual selection into automatic optimization. Specifically, we define poisoned supervised contrastive learning which can automatically learn the more uniform and universal output representations of triggers for various PLMs. Moreover, we use gradient search to select appropriate trigger words which can be adaptive to different PLMs and vocabularies. Experiments show that our method can achieve better attack performance on various text classification tasks compared to manual methods. Further, we tested our method on PLMs with different architectures, different usage paradigms, and more difficult tasks, which demonstrated the universality of our method.

Related papers

Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction [68.6543680065379]
Large language models (LLMs) are vulnerable to prompt injection attacks. We propose a novel defense method that leverages, rather than suppresses, the instruction-following abilities of LLMs.
arXiv Detail & Related papers (2025-04-29T07:13:53Z)
`Do as I say not as I do': A Semi-Automated Approach for Jailbreak Prompt Attack against Multimodal LLMs [6.151779089440453]
We introduce the first voice-based jailbreak attack against multimodal large language models (LLMs) We propose a novel strategy, in which the disallowed prompt is flanked by benign, narrative-driven prompts. We demonstrate that Flanking Attack is capable of manipulating state-of-the-art LLMs into generating misaligned and forbidden outputs.
arXiv Detail & Related papers (2025-02-02T10:05:08Z)
Universal and Context-Independent Triggers for Precise Control of LLM Outputs [6.390542864765991]
Large language models (LLMs) have been widely adopted in applications such as automated content generation and even critical decision-making systems. Recent advancements in gradient-based white-box attack techniques have shown promise in tasks like jailbreaks and system prompt leaks. We propose a novel method to efficiently discover such triggers and assess the effectiveness of the proposed attack.
arXiv Detail & Related papers (2024-11-22T05:17:18Z)
Defense Against Prompt Injection Attack by Leveraging Attack Techniques [66.65466992544728]
Large language models (LLMs) have achieved remarkable performance across various natural language processing (NLP) tasks. As LLMs continue to evolve, new vulnerabilities, especially prompt injection attacks arise. Recent attack methods leverage LLMs' instruction-following abilities and their inabilities to distinguish instructions injected in the data content.
arXiv Detail & Related papers (2024-11-01T09:14:21Z)
Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues [88.96201324719205]
This study exposes the safety vulnerabilities of Large Language Models (LLMs) in multi-turn interactions. We introduce ActorAttack, a novel multi-turn attack method inspired by actor-network theory.
arXiv Detail & Related papers (2024-10-14T16:41:49Z)
Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context [49.13497493053742]
This research explores converting a nonsensical suffix attack into a sensible prompt via a situation-driven contextual re-writing. We combine an independent, meaningful adversarial insertion and situations derived from movies to check if this can trick an LLM. Our approach demonstrates that a successful situation-driven attack can be executed on both open-source and proprietary LLMs.
arXiv Detail & Related papers (2024-07-19T19:47:26Z)
Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z)
ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text. Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z)
VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models [65.23688155159398]
Autoregressive Visual Language Models (VLMs) showcase impressive few-shot learning capabilities in a multimodal context. Recently, multimodal instruction tuning has been proposed to further enhance instruction-following abilities. Adversaries can implant a backdoor by injecting poisoned samples with triggers embedded in instructions or images. We propose a multimodal instruction backdoor attack, namely VL-Trojan.
arXiv Detail & Related papers (2024-02-21T14:54:30Z)
Modeling Adversarial Attack on Pre-trained Language Models as Sequential Decision Making [10.425483543802846]
adversarial attack task has found that pre-trained language models (PLMs) are vulnerable to small perturbations. In this paper, we model the adversarial attack task on PLMs as a sequential decision-making problem. We propose to use reinforcement learning to find an appropriate sequential attack path to generate adversaries, named SDM-Attack.
arXiv Detail & Related papers (2023-05-27T10:33:53Z)
Backdoor Pre-trained Models Can Transfer to All [33.720258110911274]
We propose a new approach to map the inputs containing triggers directly to a predefined output representation of pre-trained NLP models. In light of the unique properties of triggers in NLP, we propose two new metrics to measure the performance of backdoor attacks.
arXiv Detail & Related papers (2021-10-30T07:11:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.