Elephant in the Room: An Evaluation Framework for Assessing Adversarial
Examples in NLP
- URL: http://arxiv.org/abs/2001.07820v3
- Date: Mon, 1 Jun 2020 04:16:43 GMT
- Title: Elephant in the Room: An Evaluation Framework for Assessing Adversarial
Examples in NLP
- Authors: Ying Xu, Xu Zhong, Antonio Jose Jimeno Yepes, Jey Han Lau
- Abstract summary: An adversarial example is an input transformed by small perturbations that machine learning models consistently misclassify.
We propose an evaluation framework consisting of automatic evaluation metrics and human evaluation guidelines.
- Score: 24.661335236627053
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An adversarial example is an input transformed by small perturbations that
machine learning models consistently misclassify. While there are a number of
methods proposed to generate adversarial examples for text data, it is not
trivial to assess the quality of these adversarial examples, as minor
perturbations (such as changing a word in a sentence) can lead to a significant
shift in their meaning, readability and classification label. In this paper, we
propose an evaluation framework consisting of a set of automatic evaluation
metrics and human evaluation guidelines, to rigorously assess the quality of
adversarial examples based on the aforementioned properties. We experiment with
six benchmark attacking methods and found that some methods generate
adversarial examples with poor readability and content preservation. We also
learned that multiple factors could influence the attacking performance, such
as the length of the text inputs and architecture of the classifiers.
Related papers
- The Susceptibility of Example-Based Explainability Methods to Class Outliers [3.748789746936121]
This study explores the impact of class outliers on the effectiveness of example-based explainability methods for black-box machine learning models.
We reformulate existing explainability evaluation metrics, such as correctness and relevance, specifically for example-based methods, and introduce a new metric, distinguishability.
Using these metrics, we highlight the shortcomings of current example-based explainability methods, including those who attempt to suppress class outliers.
arXiv Detail & Related papers (2024-07-30T09:20:15Z) - A Constraint-Enforcing Reward for Adversarial Attacks on Text Classifiers [10.063169009242682]
We train an encoder-decoder paraphrase model to generate adversarial examples.
We adopt a reinforcement learning algorithm and propose a constraint-enforcing reward.
We show how key design choices impact the generated examples and discuss the strengths and weaknesses of the proposed approach.
arXiv Detail & Related papers (2024-05-20T09:33:43Z) - On Adversarial Examples for Text Classification by Perturbing Latent Representations [0.0]
We show that deep learning is vulnerable to adversarial examples in text classification.
This weakness indicates that deep learning is not very robust.
We create a framework that measures the robustness of a text classifier by using the gradients of the classifier.
arXiv Detail & Related papers (2024-05-06T18:45:18Z) - Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs)
We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric.
Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z) - Verifying the Robustness of Automatic Credibility Assessment [79.08422736721764]
Text classification methods have been widely investigated as a way to detect content of low credibility.
In some cases insignificant changes in input text can mislead the models.
We introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - Resolving label uncertainty with implicit posterior models [71.62113762278963]
We propose a method for jointly inferring labels across a collection of data samples.
By implicitly assuming the existence of a generative model for which a differentiable predictor is the posterior, we derive a training objective that allows learning under weak beliefs.
arXiv Detail & Related papers (2022-02-28T18:09:44Z) - A Review of Adversarial Attack and Defense for Classification Methods [78.50824774203495]
This paper focuses on the generation and guarding of adversarial examples.
It is the hope of the authors that this paper will encourage more statisticians to work on this important and exciting field of generating and defending against adversarial examples.
arXiv Detail & Related papers (2021-11-18T22:13:43Z) - Contrasting Human- and Machine-Generated Word-Level Adversarial Examples
for Text Classification [12.750016480098262]
We report on crowdsourcing studies in which we task humans with iteratively modifying words in an input text.
We analyze how human-generated adversarial examples compare to the recently proposed TextFooler, Genetic, BAE and SememePSO attack algorithms.
arXiv Detail & Related papers (2021-09-09T16:16:04Z) - On the Transferability of Adversarial Attacksagainst Neural Text
Classifier [121.6758865857686]
We investigate the transferability of adversarial examples for text classification models.
We propose a genetic algorithm to find an ensemble of models that can induce adversarial examples to fool almost all existing models.
We derive word replacement rules that can be used for model diagnostics from these adversarial examples.
arXiv Detail & Related papers (2020-11-17T10:45:05Z) - Weakly-Supervised Aspect-Based Sentiment Analysis via Joint
Aspect-Sentiment Topic Embedding [71.2260967797055]
We propose a weakly-supervised approach for aspect-based sentiment analysis.
We learn sentiment, aspect> joint topic embeddings in the word embedding space.
We then use neural models to generalize the word-level discriminative information.
arXiv Detail & Related papers (2020-10-13T21:33:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.