Certified Robustness to Text Adversarial Attacks by Randomized [MASK]
- URL: http://arxiv.org/abs/2105.03743v1
- Date: Sat, 8 May 2021 16:59:10 GMT
- Title: Certified Robustness to Text Adversarial Attacks by Randomized [MASK]
- Authors: Jiehang Zeng, Xiaoqing Zheng, Jianhan Xu, Linyang Li, Liping Yuan and
Xuanjing Huang
- Abstract summary: We propose a certifiably robust defense method by randomly masking a certain proportion of the words in an input text.
The proposed method can defend against not only word substitution-based attacks, but also character-level perturbations.
We can certify the classifications of over 50% texts to be robust to any perturbation of 5 words on AGNEWS, and 2 words on SST2 dataset.
- Score: 39.07743913719665
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recently, few certified defense methods have been developed to provably
guarantee the robustness of a text classifier to adversarial synonym
substitutions. However, all existing certified defense methods assume that the
defenders are informed of how the adversaries generate synonyms, which is not a
realistic scenario. In this paper, we propose a certifiably robust defense
method by randomly masking a certain proportion of the words in an input text,
in which the above unrealistic assumption is no longer necessary. The proposed
method can defend against not only word substitution-based attacks, but also
character-level perturbations. We can certify the classifications of over 50%
texts to be robust to any perturbation of 5 words on AGNEWS, and 2 words on
SST2 dataset. The experimental results show that our randomized smoothing
method significantly outperforms recently proposed defense methods across
multiple datasets.
Related papers
- MaskPure: Improving Defense Against Text Adversaries with Stochastic Purification [7.136205674624813]
In computer vision settings, the noising and de-noising process has proven useful for purifying input images.
Some initial work has explored the use of random noising and de-noising to mitigate adversarial attacks in an NLP setting.
We extend upon methods of input purification text that are inspired by diffusion processes.
Our novel method, MaskPure, exceeds or matches robustness compared to other contemporary defenses.
arXiv Detail & Related papers (2024-06-18T21:27:13Z) - Defending Large Language Models against Jailbreak Attacks via Semantic
Smoothing [107.97160023681184]
Aligned large language models (LLMs) are vulnerable to jailbreaking attacks.
We propose SEMANTICSMOOTH, a smoothing-based defense that aggregates predictions of semantically transformed copies of a given input prompt.
arXiv Detail & Related papers (2024-02-25T20:36:03Z) - Text-CRS: A Generalized Certified Robustness Framework against Textual Adversarial Attacks [39.51297217854375]
We propose Text-CRS, a certified robustness framework for natural language processing (NLP) based on randomized smoothing.
We show that Text-CRS can address all four different word-level adversarial operations and achieve a significant accuracy improvement.
We also provide the first benchmark on certified accuracy and radius of four word-level operations, besides outperforming the state-of-the-art certification against synonym substitution attacks.
arXiv Detail & Related papers (2023-07-31T13:08:16Z) - Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models.
We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - TextShield: Beyond Successfully Detecting Adversarial Sentences in Text
Classification [6.781100829062443]
Adversarial attack serves as a major challenge for neural network models in NLP, which precludes the model's deployment in safety-critical applications.
Previous detection methods are incapable of giving correct predictions on adversarial sentences.
We propose a saliency-based detector, which can effectively detect whether an input sentence is adversarial or not.
arXiv Detail & Related papers (2023-02-03T22:58:07Z) - WeDef: Weakly Supervised Backdoor Defense for Text Classification [48.19967241668793]
Existing backdoor defense methods are only effective for limited trigger types.
We propose a novel weakly supervised backdoor defense framework WeDef.
We show that WeDef is effective against popular trigger-based attacks.
arXiv Detail & Related papers (2022-05-24T05:53:11Z) - Don't sweat the small stuff, classify the rest: Sample Shielding to
protect text classifiers against adversarial attacks [2.512827436728378]
Deep learning (DL) is being used extensively for text classification.
Attackers modify the text in a way which misleads the classifier while keeping the original meaning close to intact.
We propose a novel and intuitive defense strategy called Sample Shielding.
arXiv Detail & Related papers (2022-05-03T18:24:20Z) - Randomized Substitution and Vote for Textual Adversarial Example
Detection [6.664295299367366]
A line of work has shown that natural text processing models are vulnerable to adversarial examples.
We propose a novel textual adversarial example detection method, termed Randomized Substitution and Vote (RS&V)
Empirical evaluations on three benchmark datasets demonstrate that RS&V could detect the textual adversarial examples more successfully than the existing detection methods.
arXiv Detail & Related papers (2021-09-13T04:17:58Z) - Semantic-Preserving Adversarial Text Attacks [85.32186121859321]
We propose a Bigram and Unigram based adaptive Semantic Preservation Optimization (BU-SPO) method to examine the vulnerability of deep models.
Our method achieves the highest attack success rates and semantics rates by changing the smallest number of words compared with existing methods.
arXiv Detail & Related papers (2021-08-23T09:05:18Z) - Defense against Adversarial Attacks in NLP via Dirichlet Neighborhood
Ensemble [163.3333439344695]
Dirichlet Neighborhood Ensemble (DNE) is a randomized smoothing method for training a robust model to defense substitution-based attacks.
DNE forms virtual sentences by sampling embedding vectors for each word in an input sentence from a convex hull spanned by the word and its synonyms, and it augments them with the training data.
We demonstrate through extensive experimentation that our method consistently outperforms recently proposed defense methods by a significant margin across different network architectures and multiple data sets.
arXiv Detail & Related papers (2020-06-20T18:01:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.