Related papers: Detecting Textual Adversarial Examples Based on Distributional Characteristics of Data Representations

Detecting Textual Adversarial Examples Based on Distributional Characteristics of Data Representations

URL: http://arxiv.org/abs/2204.13853v1
Date: Fri, 29 Apr 2022 02:32:02 GMT
Title: Detecting Textual Adversarial Examples Based on Distributional Characteristics of Data Representations
Authors: Na Liu, Mark Dras, Wei Emma Zhang
Abstract summary: adversarial examples are constructed by adding small non-random perturbations to correctly classified inputs. Approaches to adversarial attacks in natural language tasks have boomed in the last five years using character-level, word-level, or phrase-level perturbations. We propose two new reactive methods for NLP to fill this gap. Adapted LID and MDRE obtain state-of-the-art results on character-level, word-level, and phrase-level attacks on the IMDB dataset.
Score: 11.93653349589025
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Although deep neural networks have achieved state-of-the-art performance in various machine learning tasks, adversarial examples, constructed by adding small non-random perturbations to correctly classified inputs, successfully fool highly expressive deep classifiers into incorrect predictions. Approaches to adversarial attacks in natural language tasks have boomed in the last five years using character-level, word-level, phrase-level, or sentence-level textual perturbations. While there is some work in NLP on defending against such attacks through proactive methods, like adversarial training, there is to our knowledge no effective general reactive approaches to defence via detection of textual adversarial examples such as is found in the image processing literature. In this paper, we propose two new reactive methods for NLP to fill this gap, which unlike the few limited application baselines from NLP are based entirely on distribution characteristics of learned representations: we adapt one from the image processing literature (Local Intrinsic Dimensionality (LID)), and propose a novel one (MultiDistance Representation Ensemble Method (MDRE)). Adapted LID and MDRE obtain state-of-the-art results on character-level, word-level, and phrase-level attacks on the IMDB dataset as well as on the later two with respect to the MultiNLI dataset. For future research, we publish our code.

Related papers

SemRoDe: Macro Adversarial Training to Learn Representations That are Robust to Word-Level Attacks [29.942001958562567]
We propose a novel approach called Semantic Robust Defence (SemRoDeversa) to enhance the robustness of language models. Our method learns a robust representation that bridges these two domains. The results demonstrate promising state-of-the-art robustness.
arXiv Detail & Related papers (2024-03-27T10:24:25Z)
TextDefense: Adversarial Text Detection based on Word Importance Entropy [38.632552667871295]
We propose TextDefense, a new adversarial example detection framework for NLP models. Our experiments show that TextDefense can be applied to different architectures, datasets, and attack methods. We provide our insights into the adversarial attacks in NLP and the principles of our defense method.
arXiv Detail & Related papers (2023-02-12T11:12:44Z)
Learning-based Hybrid Local Search for the Hard-label Textual Attack [53.92227690452377]
We consider a rarely investigated but more rigorous setting, namely hard-label attack, in which the attacker could only access the prediction label. Based on this observation, we propose a novel hard-label attack, called Learning-based Hybrid Local Search (LHLS) algorithm. Our LHLS significantly outperforms existing hard-label attacks regarding the attack performance as well as adversary quality.
arXiv Detail & Related papers (2022-01-20T14:16:07Z)
Phrase-level Adversarial Example Generation for Neural Machine Translation [75.01476479100569]
We propose a phrase-level adversarial example generation (PAEG) method to enhance the robustness of the model. We verify our method on three benchmarks, including LDC Chinese-English, IWSLT14 German-English, and WMT14 English-German tasks.
arXiv Detail & Related papers (2022-01-06T11:00:49Z)
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z)
Bridge the Gap Between CV and NLP! A Gradient-based Textual Adversarial Attack Framework [17.17479625646699]
We propose a unified framework to craft textual adversarial samples. In this paper, we instantiate our framework with an attack algorithm named Textual Projected Gradient Descent (T-PGD)
arXiv Detail & Related papers (2021-10-28T17:31:51Z)
Searching for an Effective Defender: Benchmarking Defense against Adversarial Word Substitution [83.84968082791444]
Deep neural networks are vulnerable to intentionally crafted adversarial examples. Various methods have been proposed to defend against adversarial word-substitution attacks for neural NLP models.
arXiv Detail & Related papers (2021-08-29T08:11:36Z)
Defense against Adversarial Attacks in NLP via Dirichlet Neighborhood Ensemble [163.3333439344695]
Dirichlet Neighborhood Ensemble (DNE) is a randomized smoothing method for training a robust model to defense substitution-based attacks. DNE forms virtual sentences by sampling embedding vectors for each word in an input sentence from a convex hull spanned by the word and its synonyms, and it augments them with the training data. We demonstrate through extensive experimentation that our method consistently outperforms recently proposed defense methods by a significant margin across different network architectures and multiple data sets.
arXiv Detail & Related papers (2020-06-20T18:01:16Z)
Defense of Word-level Adversarial Attacks via Random Substitution Encoding [0.5964792400314836]
adversarial attacks against deep neural networks on computer vision tasks have spawned many new technologies that help protect models from avoiding false predictions. Recently, word-level adversarial attacks on deep models of Natural Language Processing (NLP) tasks have also demonstrated strong power, e.g., fooling a sentiment classification neural network to make wrong decisions. We propose a novel framework called Random Substitution RSE, which introduces a random substitution into the training process of original neural networks.
arXiv Detail & Related papers (2020-05-01T15:28:43Z)
BERT-ATTACK: Adversarial Attack Against BERT Using BERT [77.82947768158132]
Adrial attacks for discrete data (such as texts) are more challenging than continuous data (such as images) We propose textbfBERT-Attack, a high-quality and effective method to generate adversarial samples. Our method outperforms state-of-the-art attack strategies in both success rate and perturb percentage.
arXiv Detail & Related papers (2020-04-21T13:30:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.