Related papers: Gracefully Filtering Backdoor Samples for Generative Large Language Models without Retraining

Gracefully Filtering Backdoor Samples for Generative Large Language Models without Retraining

URL: http://arxiv.org/abs/2412.02454v1
Date: Tue, 03 Dec 2024 13:43:36 GMT
Title: Gracefully Filtering Backdoor Samples for Generative Large Language Models without Retraining
Authors: Zongru Wu, Pengzhou Cheng, Lingyong Fang, Zhuosheng Zhang, Gongshen Liu,
Abstract summary: Backdoor attacks are significant security threats to generative large language models (LLMs)<n>GraCeFul uses sample-wise gradients in the frequency space to identify backdoor samples without requiring retraining LLMs.<n>GraCeFul exhibits remarkable computational efficiency, achieving nearly 100% recall and F1 scores in identifying backdoor samples.
Score: 16.76094864625033
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Backdoor attacks remain significant security threats to generative large language models (LLMs). Since generative LLMs output sequences of high-dimensional token logits instead of low-dimensional classification logits, most existing backdoor defense methods designed for discriminative models like BERT are ineffective for generative LLMs. Inspired by the observed differences in learning behavior between backdoor and clean mapping in the frequency space, we transform gradients of each training sample, directly influencing parameter updates, into the frequency space. Our findings reveal a distinct separation between the gradients of backdoor and clean samples in the frequency space. Based on this phenomenon, we propose Gradient Clustering in the Frequency Space for Backdoor Sample Filtering (GraCeFul), which leverages sample-wise gradients in the frequency space to effectively identify backdoor samples without requiring retraining LLMs. Experimental results show that GraCeFul outperforms baselines significantly. Notably, GraCeFul exhibits remarkable computational efficiency, achieving nearly 100% recall and F1 scores in identifying backdoor samples, reducing the average success rate of various backdoor attacks to 0% with negligible drops in clean accuracy across multiple free-style question answering datasets. Additionally, GraCeFul generalizes to Llama-2 and Vicuna. The codes are publicly available at https://github.com/ZrW00/GraceFul.

Related papers

CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization [7.282200564983221]
Large Language Models (LLMs) are susceptible to backdoor attacks. We introduce Internal Consistency Regularization (CROW) to address layer-wise inconsistencies caused by backdoor triggers. CROW consistently achieves a significant reductions in attack success rates across diverse backdoor strategies and tasks.
arXiv Detail & Related papers (2024-11-18T07:52:12Z)
Backdoor Defense through Self-Supervised and Generative Learning [0.0]
Training on such data injects a backdoor which causes malicious inference in selected test samples. This paper explores an approach based on generative modelling of per-class distributions in a self-supervised representation space. In both cases, we find that per-class generative models allow to detect poisoned data and cleanse the dataset.
arXiv Detail & Related papers (2024-09-02T11:40:01Z)
MEGen: Generative Backdoor in Large Language Models via Model Editing [56.46183024683885]
Large language models (LLMs) have demonstrated remarkable capabilities. Their powerful generative abilities enable flexible responses based on various queries or instructions. This paper proposes an editing-based generative backdoor, named MEGen, aiming to create a customized backdoor for NLP tasks with the least side effects.
arXiv Detail & Related papers (2024-08-20T10:44:29Z)
Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable [70.77600345240867]
A novel arbitrary-in-arbitrary-out (AIAO) strategy makes watermarks resilient to fine-tuning-based removal. Unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths. Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO.
arXiv Detail & Related papers (2024-05-01T12:03:39Z)
UFID: A Unified Framework for Input-level Backdoor Detection on Diffusion Models [19.46962670935554]
Diffusion models are vulnerable to backdoor attacks. We propose a black-box input-level backdoor detection framework on diffusion models, called UFID. Our method achieves superb performance on detection effectiveness and run-time efficiency.
arXiv Detail & Related papers (2024-04-01T13:21:05Z)
Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space [17.98191594223406]
We investigate the learning mechanisms of backdoor LMs in the frequency space by Fourier analysis. We propose Multi-Scale Low-Rank Adaptation (MuScleLoRA), which deploys multiple radial scalings in the frequency space with low-rank adaptation to the target model. MuScleLoRA reduces the average success rate of diverse backdoor attacks to below 15% across multiple datasets.
arXiv Detail & Related papers (2024-02-19T10:34:48Z)
Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots [68.84056762301329]
Recent research has exposed the susceptibility of pretrained language models (PLMs) to backdoor attacks. We propose and integrate a honeypot module into the original PLM to absorb backdoor information exclusively. Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features.
arXiv Detail & Related papers (2023-10-28T08:21:16Z)
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases. We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets. Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
Backdoor Learning on Sequence to Sequence Models [94.23904400441957]
In this paper, we study whether sequence-to-sequence (seq2seq) models are vulnerable to backdoor attacks. Specifically, we find by only injecting 0.2% samples of the dataset, we can cause the seq2seq model to generate the designated keyword and even the whole sentence. Extensive experiments on machine translation and text summarization have been conducted to show our proposed methods could achieve over 90% attack success rate on multiple datasets and models.
arXiv Detail & Related papers (2023-05-03T20:31:13Z)
CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning [63.72975421109622]
CleanCLIP is a finetuning framework that weakens the learned spurious associations introduced by backdoor attacks. CleanCLIP maintains model performance on benign examples while erasing a range of backdoor attacks on multimodal contrastive learning.
arXiv Detail & Related papers (2023-03-06T17:48:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.