Related papers: Elephant in the Room: An Evaluation Framework for Assessing Adversarial Examples in NLP

Elephant in the Room: An Evaluation Framework for Assessing Adversarial Examples in NLP

URL: http://arxiv.org/abs/2001.07820v3
Date: Mon, 1 Jun 2020 04:16:43 GMT
Title: Elephant in the Room: An Evaluation Framework for Assessing Adversarial Examples in NLP
Authors: Ying Xu, Xu Zhong, Antonio Jose Jimeno Yepes, Jey Han Lau
Abstract summary: An adversarial example is an input transformed by small perturbations that machine learning models consistently misclassify. We propose an evaluation framework consisting of automatic evaluation metrics and human evaluation guidelines.
Score: 24.661335236627053
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: An adversarial example is an input transformed by small perturbations that machine learning models consistently misclassify. While there are a number of methods proposed to generate adversarial examples for text data, it is not trivial to assess the quality of these adversarial examples, as minor perturbations (such as changing a word in a sentence) can lead to a significant shift in their meaning, readability and classification label. In this paper, we propose an evaluation framework consisting of a set of automatic evaluation metrics and human evaluation guidelines, to rigorously assess the quality of adversarial examples based on the aforementioned properties. We experiment with six benchmark attacking methods and found that some methods generate adversarial examples with poor readability and content preservation. We also learned that multiple factors could influence the attacking performance, such as the length of the text inputs and architecture of the classifiers.

Related papers

Unpacking Robustness in Inflectional Languages: Adversarial Evaluation and Mechanistic Insights [2.3224139967919974]
We evaluate and explain how adversarial attacks perform in inflectional languages.<n>We use a novel protocol inspired by mechanistic interpretability, based on Edge Attribution Patching (EAP) method.<n>We create a new benchmark based on task-oriented dataset MultiEmo.
arXiv Detail & Related papers (2025-05-08T08:00:03Z)
Probing Network Decisions: Capturing Uncertainties and Unveiling Vulnerabilities Without Label Information [19.50321703079894]
We present a novel framework to uncover the weakness of the classifier via counterfactual examples. We test the performance of our prober's misclassification detection and verify its effectiveness on the image classification benchmark datasets.
arXiv Detail & Related papers (2025-03-12T05:05:58Z)
The Susceptibility of Example-Based Explainability Methods to Class Outliers [3.748789746936121]
This study explores the impact of class outliers on the effectiveness of example-based explainability methods for black-box machine learning models. We reformulate existing explainability evaluation metrics, such as correctness and relevance, specifically for example-based methods, and introduce a new metric, distinguishability. Using these metrics, we highlight the shortcomings of current example-based explainability methods, including those who attempt to suppress class outliers.
arXiv Detail & Related papers (2024-07-30T09:20:15Z)
A Constraint-Enforcing Reward for Adversarial Attacks on Text Classifiers [10.063169009242682]
We train an encoder-decoder paraphrase model to generate adversarial examples. We adopt a reinforcement learning algorithm and propose a constraint-enforcing reward. We show how key design choices impact the generated examples and discuss the strengths and weaknesses of the proposed approach.
arXiv Detail & Related papers (2024-05-20T09:33:43Z)
On Adversarial Examples for Text Classification by Perturbing Latent Representations [0.0]
We show that deep learning is vulnerable to adversarial examples in text classification. This weakness indicates that deep learning is not very robust. We create a framework that measures the robustness of a text classifier by using the gradients of the classifier.
arXiv Detail & Related papers (2024-05-06T18:45:18Z)
Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs) We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric. Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z)
Block-Sparse Adversarial Attack to Fool Transformer-Based Text Classifiers [49.50163349643615]
In this paper, we propose a gradient-based adversarial attack against transformer-based text classifiers. Experimental results demonstrate that, while our adversarial attack maintains the semantics of the sentence, it can reduce the accuracy of GPT-2 to less than 5%.
arXiv Detail & Related papers (2022-03-11T14:37:41Z)
Resolving label uncertainty with implicit posterior models [71.62113762278963]
We propose a method for jointly inferring labels across a collection of data samples. By implicitly assuming the existence of a generative model for which a differentiable predictor is the posterior, we derive a training objective that allows learning under weak beliefs.
arXiv Detail & Related papers (2022-02-28T18:09:44Z)
A Review of Adversarial Attack and Defense for Classification Methods [78.50824774203495]
This paper focuses on the generation and guarding of adversarial examples. It is the hope of the authors that this paper will encourage more statisticians to work on this important and exciting field of generating and defending against adversarial examples.
arXiv Detail & Related papers (2021-11-18T22:13:43Z)
Contrasting Human- and Machine-Generated Word-Level Adversarial Examples for Text Classification [12.750016480098262]
We report on crowdsourcing studies in which we task humans with iteratively modifying words in an input text. We analyze how human-generated adversarial examples compare to the recently proposed TextFooler, Genetic, BAE and SememePSO attack algorithms.
arXiv Detail & Related papers (2021-09-09T16:16:04Z)
On the Transferability of Adversarial Attacksagainst Neural Text Classifier [121.6758865857686]
We investigate the transferability of adversarial examples for text classification models. We propose a genetic algorithm to find an ensemble of models that can induce adversarial examples to fool almost all existing models. We derive word replacement rules that can be used for model diagnostics from these adversarial examples.
arXiv Detail & Related papers (2020-11-17T10:45:05Z)
Weakly-Supervised Aspect-Based Sentiment Analysis via Joint Aspect-Sentiment Topic Embedding [71.2260967797055]
We propose a weakly-supervised approach for aspect-based sentiment analysis. We learn sentiment, aspect> joint topic embeddings in the word embedding space. We then use neural models to generalize the word-level discriminative information.
arXiv Detail & Related papers (2020-10-13T21:33:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.