TextGuard: Provable Defense against Backdoor Attacks on Text
Classification
- URL: http://arxiv.org/abs/2311.11225v2
- Date: Sat, 25 Nov 2023 02:59:46 GMT
- Title: TextGuard: Provable Defense against Backdoor Attacks on Text
Classification
- Authors: Hengzhi Pei, Jinyuan Jia, Wenbo Guo, Bo Li, Dawn Song
- Abstract summary: We propose TextGuard, the first provable defense against backdoor attacks on text classification.
In particular, TextGuard divides the (backdoored) training data into sub-training sets, achieved by splitting each training sentence into sub-sentences.
In our evaluation, we demonstrate the effectiveness of TextGuard on three benchmark text classification tasks.
- Score: 83.94014844485291
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Backdoor attacks have become a major security threat for deploying machine
learning models in security-critical applications. Existing research endeavors
have proposed many defenses against backdoor attacks. Despite demonstrating
certain empirical defense efficacy, none of these techniques could provide a
formal and provable security guarantee against arbitrary attacks. As a result,
they can be easily broken by strong adaptive attacks, as shown in our
evaluation. In this work, we propose TextGuard, the first provable defense
against backdoor attacks on text classification. In particular, TextGuard first
divides the (backdoored) training data into sub-training sets, achieved by
splitting each training sentence into sub-sentences. This partitioning ensures
that a majority of the sub-training sets do not contain the backdoor trigger.
Subsequently, a base classifier is trained from each sub-training set, and
their ensemble provides the final prediction. We theoretically prove that when
the length of the backdoor trigger falls within a certain threshold, TextGuard
guarantees that its prediction will remain unaffected by the presence of the
triggers in training and testing inputs. In our evaluation, we demonstrate the
effectiveness of TextGuard on three benchmark text classification tasks,
surpassing the certification accuracy of existing certified defenses against
backdoor attacks. Furthermore, we propose additional strategies to enhance the
empirical performance of TextGuard. Comparisons with state-of-the-art empirical
defenses validate the superiority of TextGuard in countering multiple backdoor
attacks. Our code and data are available at
https://github.com/AI-secure/TextGuard.
Related papers
- Data Free Backdoor Attacks [83.10379074100453]
DFBA is a retraining-free and data-free backdoor attack without changing the model architecture.
We verify that our injected backdoor is provably undetectable and unchosen by various state-of-the-art defenses.
Our evaluation on multiple datasets demonstrates that our injected backdoor: 1) incurs negligible classification loss, 2) achieves 100% attack success rates, and 3) bypasses six existing state-of-the-art defenses.
arXiv Detail & Related papers (2024-12-09T05:30:25Z) - Mitigating Backdoor Attack by Injecting Proactive Defensive Backdoor [63.84477483795964]
Data-poisoning backdoor attacks are serious security threats to machine learning models.
In this paper, we focus on in-training backdoor defense, aiming to train a clean model even when the dataset may be potentially poisoned.
We propose a novel defense approach called PDB (Proactive Defensive Backdoor)
arXiv Detail & Related papers (2024-05-25T07:52:26Z) - Rethinking Backdoor Attacks [122.1008188058615]
In a backdoor attack, an adversary inserts maliciously constructed backdoor examples into a training set to make the resulting model vulnerable to manipulation.
Defending against such attacks typically involves viewing these inserted examples as outliers in the training set and using techniques from robust statistics to detect and remove them.
We show that without structural information about the training data distribution, backdoor attacks are indistinguishable from naturally-occurring features in the data.
arXiv Detail & Related papers (2023-07-19T17:44:54Z) - Detecting Backdoors in Deep Text Classifiers [43.36440869257781]
We present the first robust defence mechanism that generalizes to several backdoor attacks against text classification models.
Our technique is highly accurate at defending against state-of-the-art backdoor attacks, including data poisoning and weight poisoning.
arXiv Detail & Related papers (2022-10-11T07:48:03Z) - MM-BD: Post-Training Detection of Backdoor Attacks with Arbitrary
Backdoor Pattern Types Using a Maximum Margin Statistic [27.62279831135902]
We propose a post-training defense that detects backdoor attacks with arbitrary types of backdoor embeddings.
Our detector does not need any legitimate clean samples, and can efficiently detect backdoor attacks with arbitrary numbers of source classes.
arXiv Detail & Related papers (2022-05-13T21:32:24Z) - Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger [48.59965356276387]
We propose to use syntactic structure as the trigger in textual backdoor attacks.
We conduct extensive experiments to demonstrate that the trigger-based attack method can achieve comparable attack performance.
These results also reveal the significant insidiousness and harmfulness of textual backdoor attacks.
arXiv Detail & Related papers (2021-05-26T08:54:19Z) - Backdoor Attacks and Countermeasures on Deep Learning: A Comprehensive
Review [40.36824357892676]
This work provides the community with a timely comprehensive review of backdoor attacks and countermeasures on deep learning.
According to the attacker's capability and affected stage of the machine learning pipeline, the attack surfaces are recognized to be wide.
Countermeasures are categorized into four general classes: blind backdoor removal, offline backdoor inspection, online backdoor inspection, and post backdoor removal.
arXiv Detail & Related papers (2020-07-21T12:49:12Z) - On Certifying Robustness against Backdoor Attacks via Randomized
Smoothing [74.79764677396773]
We study the feasibility and effectiveness of certifying robustness against backdoor attacks using a recent technique called randomized smoothing.
Our results show the theoretical feasibility of using randomized smoothing to certify robustness against backdoor attacks.
Existing randomized smoothing methods have limited effectiveness at defending against backdoor attacks.
arXiv Detail & Related papers (2020-02-26T19:15:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.