FireBERT: Hardening BERT-based classifiers against adversarial attack
- URL: http://arxiv.org/abs/2008.04203v1
- Date: Mon, 10 Aug 2020 15:43:28 GMT
- Title: FireBERT: Hardening BERT-based classifiers against adversarial attack
- Authors: Gunnar Mein, Kevin Hartman, Andrew Morris
- Abstract summary: FireBERT is a set of three proof-of-concept NLP classifiers hardened against TextFooler-style word-perturbation.
We present co-tuning with a synthetic data generator as a highly effective method to protect against 95% of pre-manufactured adversarial samples.
We show that it is possible to improve the accuracy of BERT-based models in the face of adversarial attacks without significantly reducing the accuracy for regular benchmark samples.
- Score: 0.5156484100374058
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present FireBERT, a set of three proof-of-concept NLP classifiers hardened
against TextFooler-style word-perturbation by producing diverse alternatives to
original samples. In one approach, we co-tune BERT against the training data
and synthetic adversarial samples. In a second approach, we generate the
synthetic samples at evaluation time through substitution of words and
perturbation of embedding vectors. The diversified evaluation results are then
combined by voting. A third approach replaces evaluation-time word substitution
with perturbation of embedding vectors. We evaluate FireBERT for MNLI and IMDB
Movie Review datasets, in the original and on adversarial examples generated by
TextFooler. We also test whether TextFooler is less successful in creating new
adversarial samples when manipulating FireBERT, compared to working on
unhardened classifiers. We show that it is possible to improve the accuracy of
BERT-based models in the face of adversarial attacks without significantly
reducing the accuracy for regular benchmark samples. We present co-tuning with
a synthetic data generator as a highly effective method to protect against 95%
of pre-manufactured adversarial samples while maintaining 98% of original
benchmark performance. We also demonstrate evaluation-time perturbation as a
promising direction for further research, restoring accuracy up to 75% of
benchmark performance for pre-made adversarials, and up to 65% (from a baseline
of 75% orig. / 12% attack) under active attack by TextFooler.
Related papers
- Decoupled Prototype Learning for Reliable Test-Time Adaptation [50.779896759106784]
Test-time adaptation (TTA) is a task that continually adapts a pre-trained source model to the target domain during inference.
One popular approach involves fine-tuning model with cross-entropy loss according to estimated pseudo-labels.
This study reveals that minimizing the classification error of each sample causes the cross-entropy loss's vulnerability to label noise.
We propose a novel Decoupled Prototype Learning (DPL) method that features prototype-centric loss computation.
arXiv Detail & Related papers (2024-01-15T03:33:39Z) - Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - Robust Textual Embedding against Word-level Adversarial Attacks [15.235449552083043]
We propose a novel robust training method, termed Fast Triplet Metric Learning (FTML)
We show that FTML can significantly promote the model robustness against various advanced adversarial attacks.
Our work shows the great potential of improving the textual robustness through robust word embedding.
arXiv Detail & Related papers (2022-02-28T14:25:00Z) - Improving Gradient-based Adversarial Training for Text Classification by
Contrastive Learning and Auto-Encoder [18.375585982984845]
We focus on enhancing the model's ability to defend gradient-based adversarial attack during the model's training process.
We propose two novel adversarial training approaches: CARL and RAR.
Experiments show that the proposed two approaches outperform strong baselines on various text classification datasets.
arXiv Detail & Related papers (2021-09-14T09:08:58Z) - Using BERT Encoding to Tackle the Mad-lib Attack in SMS Spam Detection [0.0]
We investigate whether language models sensitive to the semantics and context of words, such as Google's BERT, may be useful to overcome this adversarial attack.
Using a dataset of 5572 SMS spam messages, we first established a baseline of detection performance.
Then, we built a thesaurus of the vocabulary contained in these messages, and set up a Mad-lib attack experiment.
We found that the classic models achieved a 94% Balanced Accuracy (BA) in the original dataset, whereas the BERT model obtained 96%.
arXiv Detail & Related papers (2021-07-13T21:17:57Z) - Double Perturbation: On the Robustness of Robustness and Counterfactual
Bias Evaluation [109.06060143938052]
We propose a "double perturbation" framework to uncover model weaknesses beyond the test dataset.
We apply this framework to study two perturbation-based approaches that are used to analyze models' robustness and counterfactual bias in English.
arXiv Detail & Related papers (2021-04-12T06:57:36Z) - Exploiting Sample Uncertainty for Domain Adaptive Person
Re-Identification [137.9939571408506]
We estimate and exploit the credibility of the assigned pseudo-label of each sample to alleviate the influence of noisy labels.
Our uncertainty-guided optimization brings significant improvement and achieves the state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2020-12-16T04:09:04Z) - BERT-ATTACK: Adversarial Attack Against BERT Using BERT [77.82947768158132]
Adrial attacks for discrete data (such as texts) are more challenging than continuous data (such as images)
We propose textbfBERT-Attack, a high-quality and effective method to generate adversarial samples.
Our method outperforms state-of-the-art attack strategies in both success rate and perturb percentage.
arXiv Detail & Related papers (2020-04-21T13:30:02Z) - Frequency-Guided Word Substitutions for Detecting Textual Adversarial
Examples [16.460051008283887]
We show that adversarial attacks against CNN, LSTM and Transformer-based classification models perform word substitutions.
We propose frequency-guided word substitutions (FGWS) for the detection of adversarial examples.
FGWS achieves strong performance by accurately detecting adversarial examples on the SST-2 and IMDb sentiment datasets.
arXiv Detail & Related papers (2020-04-13T12:11:36Z) - Self-Adversarial Learning with Comparative Discrimination for Text
Generation [111.18614166615968]
We propose a novel self-adversarial learning (SAL) paradigm for improving GANs' performance in text generation.
During training, SAL rewards the generator when its currently generated sentence is found to be better than its previously generated samples.
Experiments on text generation benchmark datasets show that our proposed approach substantially improves both the quality and the diversity.
arXiv Detail & Related papers (2020-01-31T07:50:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.