Generating Valid and Natural Adversarial Examples with Large Language
Models
- URL: http://arxiv.org/abs/2311.11861v1
- Date: Mon, 20 Nov 2023 15:57:04 GMT
- Title: Generating Valid and Natural Adversarial Examples with Large Language
Models
- Authors: Zimu Wang, Wei Wang, Qi Chen, Qiufeng Wang, Anh Nguyen
- Abstract summary: adversarial attack models are not valid nor natural, leading to the loss of semantic maintenance, grammaticality, and human imperceptibility.
We propose LLM-Attack, which aims at generating both valid and natural adversarial examples with large language models.
Experimental results on the Movie Review (MR), IMDB, and Review Polarity datasets against the baseline adversarial attack models illustrate the effectiveness of LLM-Attack.
- Score: 18.944937459278197
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning-based natural language processing (NLP) models, particularly
pre-trained language models (PLMs), have been revealed to be vulnerable to
adversarial attacks. However, the adversarial examples generated by many
mainstream word-level adversarial attack models are neither valid nor natural,
leading to the loss of semantic maintenance, grammaticality, and human
imperceptibility. Based on the exceptional capacity of language understanding
and generation of large language models (LLMs), we propose LLM-Attack, which
aims at generating both valid and natural adversarial examples with LLMs. The
method consists of two stages: word importance ranking (which searches for the
most vulnerable words) and word synonym replacement (which substitutes them
with their synonyms obtained from LLMs). Experimental results on the Movie
Review (MR), IMDB, and Yelp Review Polarity datasets against the baseline
adversarial attack models illustrate the effectiveness of LLM-Attack, and it
outperforms the baselines in human and GPT-4 evaluation by a significant
margin. The model can generate adversarial examples that are typically valid
and natural, with the preservation of semantic meaning, grammaticality, and
human imperceptibility.
Related papers
- Enhancing adversarial robustness in Natural Language Inference using explanations [41.46494686136601]
We cast the spotlight on the underexplored task of Natural Language Inference (NLI)
We validate the usage of natural language explanation as a model-agnostic defence strategy through extensive experimentation.
We research the correlation of widely used language generation metrics with human perception, in order for them to serve as a proxy towards robust NLI models.
arXiv Detail & Related papers (2024-09-11T17:09:49Z) - Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models [47.545382591646565]
Large Language Models (LLMs) have excelled at language understanding and generating human-level text.
LLMs are susceptible to adversarial attacks where malicious users prompt the model to generate undesirable text.
In this work, we train models to automatically create adversarial prompts to elicit biased responses from target LLMs.
arXiv Detail & Related papers (2024-08-07T17:11:34Z) - Unraveling the Dominance of Large Language Models Over Transformer Models for Bangla Natural Language Inference: A Comprehensive Study [0.0]
Natural Language Inference (NLI) is a cornerstone of Natural Language Processing (NLP)
This study addresses the underexplored area of evaluating Large Language Models (LLMs) in low-resourced languages like Bengali.
arXiv Detail & Related papers (2024-05-05T13:57:05Z) - Improving Language Models Meaning Understanding and Consistency by
Learning Conceptual Roles from Dictionary [65.268245109828]
Non-human-like behaviour of contemporary pre-trained language models (PLMs) is a leading cause undermining their trustworthiness.
A striking phenomenon is the generation of inconsistent predictions, which produces contradictory results.
We propose a practical approach that alleviates the inconsistent behaviour issue by improving PLM awareness.
arXiv Detail & Related papers (2023-10-24T06:15:15Z) - Context-aware Adversarial Attack on Named Entity Recognition [15.049160192547909]
We study context-aware adversarial attack methods to examine the model's robustness.
Specifically, we propose perturbing the most informative words for recognizing entities to create adversarial examples.
Experiments and analyses show that our methods are more effective in deceiving the model into making wrong predictions than strong baselines.
arXiv Detail & Related papers (2023-09-16T14:04:23Z) - Language models are not naysayers: An analysis of language models on
negation benchmarks [58.32362243122714]
We evaluate the ability of current-generation auto-regressive language models to handle negation.
We show that LLMs have several limitations including insensitivity to the presence of negation, an inability to capture the lexical semantics of negation, and a failure to reason under negation.
arXiv Detail & Related papers (2023-06-14T01:16:37Z) - How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial
Robustness? [121.57551065856164]
We propose Robust Informative Fine-Tuning (RIFT) as a novel adversarial fine-tuning method from an information-theoretical perspective.
RIFT encourages an objective model to retain the features learned from the pre-trained model throughout the entire fine-tuning process.
Experimental results show that RIFT consistently outperforms the state-of-the-arts on two popular NLP tasks.
arXiv Detail & Related papers (2021-12-22T05:04:41Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - Contextualized Perturbation for Textual Adversarial Attack [56.370304308573274]
Adversarial examples expose the vulnerabilities of natural language processing (NLP) models.
This paper presents CLARE, a ContextuaLized AdversaRial Example generation model that produces fluent and grammatical outputs.
arXiv Detail & Related papers (2020-09-16T06:53:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.