Attacking interpretable NLP systems
- URL: http://arxiv.org/abs/2507.16164v1
- Date: Tue, 22 Jul 2025 02:20:00 GMT
- Title: Attacking interpretable NLP systems
- Authors: Eldor Abdukhamidov, Tamer Abuhmed, Joanna C. S. Santos, Mohammed Abuhamad,
- Abstract summary: This paper introduces AdvChar, a black-box attack on Interpretable Natural Language Processing Systems.<n>We show that AdvChar can significantly reduce the prediction accuracy of current deep learning models by altering just two characters on average in input samples.
- Score: 1.5074441766044933
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Studies have shown that machine learning systems are vulnerable to adversarial examples in theory and practice. Where previous attacks have focused mainly on visual models that exploit the difference between human and machine perception, text-based models have also fallen victim to these attacks. However, these attacks often fail to maintain the semantic meaning of the text and similarity. This paper introduces AdvChar, a black-box attack on Interpretable Natural Language Processing Systems, designed to mislead the classifier while keeping the interpretation similar to benign inputs, thus exploiting trust in system transparency. AdvChar achieves this by making less noticeable modifications to text input, forcing the deep learning classifier to make incorrect predictions and preserve the original interpretation. We use an interpretation-focused scoring approach to determine the most critical tokens that, when changed, can cause the classifier to misclassify the input. We apply simple character-level modifications to measure the importance of tokens, minimizing the difference between the original and new text while generating adversarial interpretations similar to benign ones. We thoroughly evaluated AdvChar by testing it against seven NLP models and three interpretation models using benchmark datasets for the classification task. Our experiments show that AdvChar can significantly reduce the prediction accuracy of current deep learning models by altering just two characters on average in input samples.
Related papers
- On Adversarial Examples for Text Classification by Perturbing Latent Representations [0.0]
We show that deep learning is vulnerable to adversarial examples in text classification.
This weakness indicates that deep learning is not very robust.
We create a framework that measures the robustness of a text classifier by using the gradients of the classifier.
arXiv Detail & Related papers (2024-05-06T18:45:18Z) - Forging the Forger: An Attempt to Improve Authorship Verification via Data Augmentation [52.72682366640554]
Authorship Verification (AV) is a text classification task concerned with inferring whether a candidate text has been written by one specific author or by someone else.
It has been shown that many AV systems are vulnerable to adversarial attacks, where a malicious author actively tries to fool the classifier by either concealing their writing style, or by imitating the style of another author.
arXiv Detail & Related papers (2024-03-17T16:36:26Z) - Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models.
We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - Identifying Adversarial Attacks on Text Classifiers [32.958568467774704]
In this paper, we analyze adversarial text to determine which methods were used to create it.
Our first contribution is an extensive dataset for attack detection and labeling.
As our second contribution, we use this dataset to develop and benchmark a number of classifiers for attack identification.
arXiv Detail & Related papers (2022-01-21T06:16:04Z) - Learning-based Hybrid Local Search for the Hard-label Textual Attack [53.92227690452377]
We consider a rarely investigated but more rigorous setting, namely hard-label attack, in which the attacker could only access the prediction label.
Based on this observation, we propose a novel hard-label attack, called Learning-based Hybrid Local Search (LHLS) algorithm.
Our LHLS significantly outperforms existing hard-label attacks regarding the attack performance as well as adversary quality.
arXiv Detail & Related papers (2022-01-20T14:16:07Z) - Towards A Conceptually Simple Defensive Approach for Few-shot
classifiers Against Adversarial Support Samples [107.38834819682315]
We study a conceptually simple approach to defend few-shot classifiers against adversarial attacks.
We propose a simple attack-agnostic detection method, using the concept of self-similarity and filtering.
Our evaluation on the miniImagenet (MI) and CUB datasets exhibit good attack detection performance.
arXiv Detail & Related papers (2021-10-24T05:46:03Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z) - A Differentiable Language Model Adversarial Attack on Text Classifiers [10.658675415759697]
We propose a new black-box sentence-level attack for natural language processing.
Our method fine-tunes a pre-trained language model to generate adversarial examples.
We show that the proposed attack outperforms competitors on a diverse set of NLP problems for both computed metrics and human evaluation.
arXiv Detail & Related papers (2021-07-23T14:43:13Z) - BERT-Defense: A Probabilistic Model Based on BERT to Combat Cognitively
Inspired Orthographic Adversarial Attacks [10.290050493635343]
Adversarial attacks expose important blind spots of deep learning systems.
Character-level attacks typically insert typos into the input stream.
We show that an untrained iterative approach can perform on par with human crowd-workers supervised via 3-shot learning.
arXiv Detail & Related papers (2021-06-02T20:21:03Z) - Learning Variational Word Masks to Improve the Interpretability of
Neural Text Classifiers [21.594361495948316]
A new line of work on improving model interpretability has just started, and many existing methods require either prior information or human annotations as additional inputs in training.
We propose the variational word mask (VMASK) method to automatically learn task-specific important words and reduce irrelevant information on classification, which ultimately improves the interpretability of model predictions.
arXiv Detail & Related papers (2020-10-01T20:02:43Z) - Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial
Perturbations [65.05561023880351]
Adversarial examples are malicious inputs crafted to induce misclassification.
This paper studies a complementary failure mode, invariance-based adversarial examples.
We show that defenses against sensitivity-based attacks actively harm a model's accuracy on invariance-based attacks.
arXiv Detail & Related papers (2020-02-11T18:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.