SEPP: Similarity Estimation of Predicted Probabilities for Defending and
Detecting Adversarial Text
- URL: http://arxiv.org/abs/2110.05748v2
- Date: Wed, 13 Oct 2021 02:17:45 GMT
- Title: SEPP: Similarity Estimation of Predicted Probabilities for Defending and
Detecting Adversarial Text
- Authors: Hoang-Quoc Nguyen-Son, Seira Hidano, Kazuhide Fukushima, Shinsaku
Kiyomoto
- Abstract summary: We propose an ensemble model based on similarity estimation of predicted probabilities (SEPP) to exploit the large gaps in the misclassified predictions.
We demonstrate the resilience of SEPP in defending and detecting adversarial texts through different types of victim classifiers.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There are two cases describing how a classifier processes input text, namely,
misclassification and correct classification. In terms of misclassified texts,
a classifier handles the texts with both incorrect predictions and adversarial
texts, which are generated to fool the classifier, which is called a victim.
Both types are misunderstood by the victim, but they can still be recognized by
other classifiers. This induces large gaps in predicted probabilities between
the victim and the other classifiers. In contrast, text correctly classified by
the victim is often successfully predicted by the others and induces small
gaps. In this paper, we propose an ensemble model based on similarity
estimation of predicted probabilities (SEPP) to exploit the large gaps in the
misclassified predictions in contrast to small gaps in the correct
classification. SEPP then corrects the incorrect predictions of the
misclassified texts. We demonstrate the resilience of SEPP in defending and
detecting adversarial texts through different types of victim classifiers,
classification tasks, and adversarial attacks.
Related papers
- Forging the Forger: An Attempt to Improve Authorship Verification via Data Augmentation [52.72682366640554]
Authorship Verification (AV) is a text classification task concerned with inferring whether a candidate text has been written by one specific author or by someone else.
It has been shown that many AV systems are vulnerable to adversarial attacks, where a malicious author actively tries to fool the classifier by either concealing their writing style, or by imitating the style of another author.
arXiv Detail & Related papers (2024-03-17T16:36:26Z) - Understanding and Mitigating Spurious Correlations in Text
Classification with Neighborhood Analysis [69.07674653828565]
Machine learning models have a tendency to leverage spurious correlations that exist in the training set but may not hold true in general circumstances.
In this paper, we examine the implications of spurious correlations through a novel perspective called neighborhood analysis.
We propose a family of regularization methods, NFL (doN't Forget your Language) to mitigate spurious correlations in text classification.
arXiv Detail & Related papers (2023-05-23T03:55:50Z) - Towards Fair Classification against Poisoning Attacks [52.57443558122475]
We study the poisoning scenario where the attacker can insert a small fraction of samples into training data.
We propose a general and theoretically guaranteed framework which accommodates traditional defense methods to fair classification against poisoning attacks.
arXiv Detail & Related papers (2022-10-18T00:49:58Z) - On the reversibility of adversarial attacks [41.94594666541757]
Adversarial attacks modify images with perturbations that change the prediction of classifiers.
We investigate the predictability of the mapping between the classes predicted for original images and for their corresponding adversarial examples.
We quantify reversibility as the accuracy in retrieving the original class or the true class of an adversarial example.
arXiv Detail & Related papers (2022-06-01T21:18:11Z) - Necessity and Sufficiency for Explaining Text Classifiers: A Case Study
in Hate Speech Detection [7.022948483613112]
We present a novel feature attribution method for explaining text classifiers, and analyze it in the context of hate speech detection.
We provide two complementary and theoretically-grounded scores -- necessity and sufficiency -- resulting in more informative explanations.
We employ our method to explain the predictions of different hate speech detection models on the same set of curated examples from a test suite, and show that different values of necessity and sufficiency for identity terms correspond to different kinds of false positive errors.
arXiv Detail & Related papers (2022-05-06T15:34:48Z) - Learning to Separate Clusters of Adversarial Representations for Robust
Adversarial Detection [50.03939695025513]
We propose a new probabilistic adversarial detector motivated by a recently introduced non-robust feature.
In this paper, we consider the non-robust features as a common property of adversarial examples, and we deduce it is possible to find a cluster in representation space corresponding to the property.
This idea leads us to probability estimate distribution of adversarial representations in a separate cluster, and leverage the distribution for a likelihood based adversarial detector.
arXiv Detail & Related papers (2020-12-07T07:21:18Z) - ATRO: Adversarial Training with a Rejection Option [10.36668157679368]
This paper proposes a classification framework with a rejection option to mitigate the performance deterioration caused by adversarial examples.
Applying the adversarial training objective to both a classifier and a rejection function simultaneously, we can choose to abstain from classification when it has insufficient confidence to classify a test data point.
arXiv Detail & Related papers (2020-10-24T14:05:03Z) - Identifying Spurious Correlations for Robust Text Classification [9.457737910527829]
We propose a method to distinguish spurious and genuine correlations in text classification.
We use features derived from treatment effect estimators to distinguish spurious correlations from "genuine" ones.
Experiments on four datasets suggest that using this approach to inform feature selection also leads to more robust classification.
arXiv Detail & Related papers (2020-10-06T03:49:22Z) - Classifier-independent Lower-Bounds for Adversarial Robustness [13.247278149124757]
We theoretically analyse the limits of robustness to test-time adversarial and noisy examples in classification.
We use optimal transport theory to derive variational formulae for the Bayes-optimal error a classifier can make on a given classification problem.
We derive explicit lower-bounds on the Bayes-optimal error in the case of the popular distance-based attacks.
arXiv Detail & Related papers (2020-06-17T16:46:39Z) - Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial
Perturbations [65.05561023880351]
Adversarial examples are malicious inputs crafted to induce misclassification.
This paper studies a complementary failure mode, invariance-based adversarial examples.
We show that defenses against sensitivity-based attacks actively harm a model's accuracy on invariance-based attacks.
arXiv Detail & Related papers (2020-02-11T18:50:23Z) - Certified Robustness to Label-Flipping Attacks via Randomized Smoothing [105.91827623768724]
Machine learning algorithms are susceptible to data poisoning attacks.
We present a unifying view of randomized smoothing over arbitrary functions.
We propose a new strategy for building classifiers that are pointwise-certifiably robust to general data poisoning attacks.
arXiv Detail & Related papers (2020-02-07T21:28:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.