Related papers: Obliviate: Neutralizing Task-agnostic Backdoors within the Parameter-efficient Fine-tuning Paradigm

Obliviate: Neutralizing Task-agnostic Backdoors within the Parameter-efficient Fine-tuning Paradigm

URL: http://arxiv.org/abs/2409.14119v3
Date: Sun, 6 Oct 2024 09:43:25 GMT
Title: Obliviate: Neutralizing Task-agnostic Backdoors within the Parameter-efficient Fine-tuning Paradigm
Authors: Jaehan Kim, Minkyoo Song, Seung Ho Na, Seungwon Shin,
Abstract summary: We introduce Obliviate, a PEFT-integrable backdoor defense. We develop two techniques aimed at amplifying benign neurons within PEFT layers and penalizing the influence of trigger tokens. Our method exhibits robust defense capabilities against both task-specific backdoors and adaptive attacks.
Score: 8.905741632785183
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Parameter-efficient fine-tuning (PEFT) has become a key training strategy for large language models. However, its reliance on fewer trainable parameters poses security risks, such as task-agnostic backdoors. Despite their severe impact on a wide range of tasks, there is no practical defense solution available that effectively counters task-agnostic backdoors within the context of PEFT. In this study, we introduce Obliviate, a PEFT-integrable backdoor defense. We develop two techniques aimed at amplifying benign neurons within PEFT layers and penalizing the influence of trigger tokens. Our evaluations across three major PEFT architectures show that our method can significantly reduce the attack success rate of the state-of-the-art task-agnostic backdoors (83.6%$\downarrow$). Furthermore, our method exhibits robust defense capabilities against both task-specific backdoors and adaptive attacks. Source code will be obtained at https://github.com/obliviateARR/Obliviate.

Related papers

ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models [55.93380086403591]
Generative large language models are vulnerable to backdoor attacks. $textitELBA-Bench$ allows attackers to inject backdoor through parameter efficient fine-tuning. $textitELBA-Bench$ provides over 1300 experiments.
arXiv Detail & Related papers (2025-02-22T12:55:28Z)
Backdoor Token Unlearning: Exposing and Defending Backdoors in Pretrained Language Models [9.995807326278959]
We propose a novel defense method called Backdoor Token Unlearning (BTU), which proactively detects and neutralizes trigger tokens during the training stage. Our work is based on two key findings: 1) backdoor learning causes distinctive differences between backdoor token parameters and clean token parameters in word embedding layers, and 2) the success of backdoor attacks heavily depends on backdoor token parameters.
arXiv Detail & Related papers (2025-01-05T03:22:13Z)
PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning [8.61459170031022]
This paper introduces a novel security threat to FedPEFT, termed PEFT-as-an-Attack (PaaA) Our evaluation of PaaA reveals that with less than 1% of the model's parameters set as trainable, and a small subset of clients acting maliciously, the attack achieves an approximate 80% attack success rate using representative PEFT methods such as LoRA. Our results underscore the urgent need for more effective defense mechanisms that simultaneously ensure security and maintain the performance of the FedPEFT paradigm.
arXiv Detail & Related papers (2024-11-28T19:05:01Z)
Unlearning Backdoor Attacks for LLMs with Weak-to-Strong Knowledge Distillation [10.888542040021962]
W2SDefense is a weak-to-strong unlearning algorithm to defend against backdoor attacks. We conduct experiments on text classification tasks involving three state-of-the-art language models and three different backdoor attack algorithms.
arXiv Detail & Related papers (2024-10-18T12:39:32Z)
Efficient Backdoor Defense in Multimodal Contrastive Learning: A Token-Level Unlearning Method for Mitigating Threats [52.94388672185062]
We propose an efficient defense mechanism against backdoor threats using a concept known as machine unlearning. This entails strategically creating a small set of poisoned samples to aid the model's rapid unlearning of backdoor vulnerabilities. In the backdoor unlearning process, we present a novel token-based portion unlearning training regime.
arXiv Detail & Related papers (2024-09-29T02:55:38Z)
Weak-to-Strong Backdoor Attack for Large Language Models [15.055037707091435]
We propose a novel backdoor attack algorithm from weak to strong based on feature alignment-enhanced knowledge distillation (W2SAttack) We demonstrate the superior performance of W2SAttack on classification tasks across four language models, four backdoor attack algorithms, and two different architectures of teacher models.
arXiv Detail & Related papers (2024-09-26T15:20:37Z)
TrojFM: Resource-efficient Backdoor Attacks against Very Large Foundation Models [69.37990698561299]
TrojFM is a novel backdoor attack tailored for very large foundation models. Our approach injects backdoors by fine-tuning only a very small proportion of model parameters. We demonstrate that TrojFM can launch effective backdoor attacks against widely used large GPT-style models.
arXiv Detail & Related papers (2024-05-27T03:10:57Z)
Mitigating Backdoor Attack by Injecting Proactive Defensive Backdoor [63.84477483795964]
Data-poisoning backdoor attacks are serious security threats to machine learning models. In this paper, we focus on in-training backdoor defense, aiming to train a clean model even when the dataset may be potentially poisoned. We propose a novel defense approach called PDB (Proactive Defensive Backdoor)
arXiv Detail & Related papers (2024-05-25T07:52:26Z)
Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning [57.50274256088251]
We show that parameter-efficient fine-tuning (PEFT) is more susceptible to weight-poisoning backdoor attacks. We develop a Poisoned Sample Identification Module (PSIM) leveraging PEFT, which identifies poisoned samples through confidence. We conduct experiments on text classification tasks, five fine-tuning strategies, and three weight-poisoning backdoor attack methods.
arXiv Detail & Related papers (2024-02-19T14:22:54Z)
BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses. We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z)
Backdoor Attack with Sparse and Invisible Trigger [57.41876708712008]
Deep neural networks (DNNs) are vulnerable to backdoor attacks. backdoor attack is an emerging yet threatening training-phase threat. We propose a sparse and invisible backdoor attack (SIBA)
arXiv Detail & Related papers (2023-05-11T10:05:57Z)
SoK: A Systematic Evaluation of Backdoor Trigger Characteristics in Image Classification [21.424907311421197]
Deep learning is vulnerable to backdoor attacks that modify the training set to embed a secret functionality in the trained model. This paper systematically analyzes the most relevant parameters for the backdoor attacks. Our attacks cover the majority of backdoor settings in research, providing concrete directions for future works.
arXiv Detail & Related papers (2023-02-03T14:00:05Z)
Stealthy Backdoor Attack for Code Models [19.272856932095966]
Existing backdoor attacks on code models use unstealthy and easy-to-detect triggers. This paper aims to investigate the vulnerability of code models with stealthy backdoor attacks. We find that around 85% of adaptive triggers in AFRAIDOOR bypass the detection in the defense process.
arXiv Detail & Related papers (2023-01-06T13:15:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.