Differentiable Language Model Adversarial Attacks on Categorical
Sequence Classifiers
- URL: http://arxiv.org/abs/2006.11078v1
- Date: Fri, 19 Jun 2020 11:25:36 GMT
- Title: Differentiable Language Model Adversarial Attacks on Categorical
Sequence Classifiers
- Authors: I. Fursov, A. Zaytsev, N. Kluchnikov, A. Kravchenko, E. Burnaev
- Abstract summary: An adversarial attack paradigm explores various scenarios for the vulnerability of deep learning models.
We use a fine-tuning of a language model for adversarial attacks as a generator of adversarial examples.
Our model works for diverse datasets on bank transactions, electronic health records, and NLP datasets.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An adversarial attack paradigm explores various scenarios for the
vulnerability of deep learning models: minor changes of the input can force a
model failure. Most of the state of the art frameworks focus on adversarial
attacks for images and other structured model inputs, but not for categorical
sequences models.
Successful attacks on classifiers of categorical sequences are challenging
because the model input is tokens from finite sets, so a classifier score is
non-differentiable with respect to inputs, and gradient-based attacks are not
applicable. Common approaches deal with this problem working at a token level,
while the discrete optimization problem at hand requires a lot of resources to
solve.
We instead use a fine-tuning of a language model for adversarial attacks as a
generator of adversarial examples. To optimize the model, we define a
differentiable loss function that depends on a surrogate classifier score and
on a deep learning model that evaluates approximate edit distance. So, we
control both the adversability of a generated sequence and its similarity to
the initial sequence.
As a result, we obtain semantically better samples. Moreover, they are
resistant to adversarial training and adversarial detectors. Our model works
for diverse datasets on bank transactions, electronic health records, and NLP
datasets.
Related papers
- MOREL: Enhancing Adversarial Robustness through Multi-Objective Representation Learning [1.534667887016089]
deep neural networks (DNNs) are vulnerable to slight adversarial perturbations.
We show that strong feature representation learning during training can significantly enhance the original model's robustness.
We propose MOREL, a multi-objective feature representation learning approach, encouraging classification models to produce similar features for inputs within the same class, despite perturbations.
arXiv Detail & Related papers (2024-10-02T16:05:03Z) - Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs)
We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model.
We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z) - Are aligned neural networks adversarially aligned? [93.91072860401856]
adversarial users can construct inputs which circumvent attempts at alignment.
We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models.
We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models.
arXiv Detail & Related papers (2023-06-26T17:18:44Z) - On Evaluating the Adversarial Robustness of Semantic Segmentation Models [0.0]
A number of adversarial training approaches have been proposed as a defense against adversarial perturbation.
We show for the first time that a number of models in previous work that are claimed to be robust are in fact not robust at all.
We then evaluate simple adversarial training algorithms that produce reasonably robust models even under our set of strong attacks.
arXiv Detail & Related papers (2023-06-25T11:45:08Z) - NaturalAdversaries: Can Naturalistic Adversaries Be as Effective as
Artificial Adversaries? [61.58261351116679]
We introduce a two-stage adversarial example generation framework (NaturalAdversaries) for natural language understanding tasks.
It is adaptable to both black-box and white-box adversarial attacks based on the level of access to the model parameters.
Our results indicate these adversaries generalize across domains, and offer insights for future research on improving robustness of neural text classification models.
arXiv Detail & Related papers (2022-11-08T16:37:34Z) - A Differentiable Language Model Adversarial Attack on Text Classifiers [10.658675415759697]
We propose a new black-box sentence-level attack for natural language processing.
Our method fine-tunes a pre-trained language model to generate adversarial examples.
We show that the proposed attack outperforms competitors on a diverse set of NLP problems for both computed metrics and human evaluation.
arXiv Detail & Related papers (2021-07-23T14:43:13Z) - On the Transferability of Adversarial Attacksagainst Neural Text
Classifier [121.6758865857686]
We investigate the transferability of adversarial examples for text classification models.
We propose a genetic algorithm to find an ensemble of models that can induce adversarial examples to fool almost all existing models.
We derive word replacement rules that can be used for model diagnostics from these adversarial examples.
arXiv Detail & Related papers (2020-11-17T10:45:05Z) - ATRO: Adversarial Training with a Rejection Option [10.36668157679368]
This paper proposes a classification framework with a rejection option to mitigate the performance deterioration caused by adversarial examples.
Applying the adversarial training objective to both a classifier and a rejection function simultaneously, we can choose to abstain from classification when it has insufficient confidence to classify a test data point.
arXiv Detail & Related papers (2020-10-24T14:05:03Z) - Understanding Classifier Mistakes with Generative Models [88.20470690631372]
Deep neural networks are effective on supervised learning tasks, but have been shown to be brittle.
In this paper, we leverage generative models to identify and characterize instances where classifiers fail to generalize.
Our approach is agnostic to class labels from the training set which makes it applicable to models trained in a semi-supervised way.
arXiv Detail & Related papers (2020-10-05T22:13:21Z) - Gradient-based adversarial attacks on categorical sequence models via
traversing an embedded world [11.711134497239332]
We consider adversarial attacks on deep learning models with categorical sequences.
We handle these challenges using two black-box adversarial attacks.
Results for money transactions, medical fraud, and NLP datasets suggest that proposed methods generate reasonable adversarial sequences.
arXiv Detail & Related papers (2020-03-09T14:31:36Z) - Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial
Perturbations [65.05561023880351]
Adversarial examples are malicious inputs crafted to induce misclassification.
This paper studies a complementary failure mode, invariance-based adversarial examples.
We show that defenses against sensitivity-based attacks actively harm a model's accuracy on invariance-based attacks.
arXiv Detail & Related papers (2020-02-11T18:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.