WeDef: Weakly Supervised Backdoor Defense for Text Classification
- URL: http://arxiv.org/abs/2205.11803v1
- Date: Tue, 24 May 2022 05:53:11 GMT
- Title: WeDef: Weakly Supervised Backdoor Defense for Text Classification
- Authors: Lesheng Jin, Zihan Wang, Jingbo Shang
- Abstract summary: Existing backdoor defense methods are only effective for limited trigger types.
We propose a novel weakly supervised backdoor defense framework WeDef.
We show that WeDef is effective against popular trigger-based attacks.
- Score: 48.19967241668793
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing backdoor defense methods are only effective for limited trigger
types. To defend different trigger types at once, we start from the
class-irrelevant nature of the poisoning process and propose a novel weakly
supervised backdoor defense framework WeDef. Recent advances in weak
supervision make it possible to train a reasonably accurate text classifier
using only a small number of user-provided, class-indicative seed words. Such
seed words shall be considered independent of the triggers. Therefore, a weakly
supervised text classifier trained by only the poisoned documents without their
labels will likely have no backdoor. Inspired by this observation, in WeDef, we
define the reliability of samples based on whether the predictions of the weak
classifier agree with their labels in the poisoned training set. We further
improve the results through a two-phase sanitization: (1) iteratively refine
the weak classifier based on the reliable samples and (2) train a binary poison
classifier by distinguishing the most unreliable samples from the most reliable
samples. Finally, we train the sanitized model on the samples that the poison
classifier predicts as benign. Extensive experiments show that WeDefis
effective against popular trigger-based attacks (e.g., words, sentences, and
paraphrases), outperforming existing defense methods.
Related papers
- FCert: Certifiably Robust Few-Shot Classification in the Era of Foundation Models [38.019489232264796]
We propose FCert, the first certified defense against data poisoning attacks to few-shot classification.
Our experimental results show our FCert: 1) maintains classification accuracy without attacks, 2) outperforms existing certified defenses for data poisoning attacks, and 3) is efficient and general.
arXiv Detail & Related papers (2024-04-12T17:50:40Z) - Diffusion Denoising as a Certified Defense against Clean-label Poisoning [56.04951180983087]
We show how an off-the-shelf diffusion model can sanitize the tampered training data.
We extensively test our defense against seven clean-label poisoning attacks and reduce their attack success to 0-16% with only a negligible drop in the test time accuracy.
arXiv Detail & Related papers (2024-03-18T17:17:07Z) - ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned
Samples in NLP [29.375957205348115]
We propose an innovative test-time poisoned sample detection framework that hinges on the interpretability of model predictions.
We employ ChatGPT, a state-of-the-art large language model, as our paraphraser and formulate the trigger-removal task as a prompt engineering problem.
arXiv Detail & Related papers (2023-08-04T03:48:28Z) - Towards Fair Classification against Poisoning Attacks [52.57443558122475]
We study the poisoning scenario where the attacker can insert a small fraction of samples into training data.
We propose a general and theoretically guaranteed framework which accommodates traditional defense methods to fair classification against poisoning attacks.
arXiv Detail & Related papers (2022-10-18T00:49:58Z) - A Unified Evaluation of Textual Backdoor Learning: Frameworks and
Benchmarks [72.7373468905418]
We develop an open-source toolkit OpenBackdoor to foster the implementations and evaluations of textual backdoor learning.
We also propose CUBE, a simple yet strong clustering-based defense baseline.
arXiv Detail & Related papers (2022-06-17T02:29:23Z) - BFClass: A Backdoor-free Text Classification Framework [21.762274809679692]
We propose BFClass, a novel efficient backdoor-free training framework for text classification.
The backbone of BFClass is a pre-trained discriminator that predicts whether each token in the corrupted input was replaced by a masked language model.
Extensive experiments demonstrate that BFClass can identify all the triggers, remove 95% poisoned training samples with very limited false alarms, and achieve almost the same performance as the models trained on the benign training data.
arXiv Detail & Related papers (2021-09-22T17:28:21Z) - Poisoned classifiers are not only backdoored, they are fundamentally
broken [84.67778403778442]
Under a commonly-studied backdoor poisoning attack against classification models, an attacker adds a small trigger to a subset of the training data.
It is often assumed that the poisoned classifier is vulnerable exclusively to the adversary who possesses the trigger.
In this paper, we show empirically that this view of backdoored classifiers is incorrect.
arXiv Detail & Related papers (2020-10-18T19:42:44Z) - Poison Attacks against Text Datasets with Conditional Adversarially
Regularized Autoencoder [78.01180944665089]
This paper demonstrates a fatal vulnerability in natural language inference (NLI) and text classification systems.
We present a 'backdoor poisoning' attack on NLP models.
arXiv Detail & Related papers (2020-10-06T13:03:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.