Defending Pre-trained Language Models from Adversarial Word
Substitutions Without Performance Sacrifice
- URL: http://arxiv.org/abs/2105.14553v1
- Date: Sun, 30 May 2021 14:24:53 GMT
- Title: Defending Pre-trained Language Models from Adversarial Word
Substitutions Without Performance Sacrifice
- Authors: Rongzhou Bao, Jiayi Wang, Hai Zhao
- Abstract summary: adversarial word substitution is one of the most challenging textual adversarial attack methods.
This paper presents a compact and performance-preserved framework, Anomaly Detection with Frequency-Aware Randomization (ADFAR)
We show that ADFAR significantly outperforms those newly proposed defense methods over various tasks with much higher inference speed.
- Score: 42.490810188180546
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-trained contextualized language models (PrLMs) have led to strong
performance gains in downstream natural language understanding tasks. However,
PrLMs can still be easily fooled by adversarial word substitution, which is one
of the most challenging textual adversarial attack methods. Existing defence
approaches suffer from notable performance loss and complexities. Thus, this
paper presents a compact and performance-preserved framework, Anomaly Detection
with Frequency-Aware Randomization (ADFAR). In detail, we design an auxiliary
anomaly detection classifier and adopt a multi-task learning procedure, by
which PrLMs are able to distinguish adversarial input samples. Then, in order
to defend adversarial word substitution, a frequency-aware randomization
process is applied to those recognized adversarial input samples. Empirical
results show that ADFAR significantly outperforms those newly proposed defense
methods over various tasks with much higher inference speed. Remarkably, ADFAR
does not impair the overall performance of PrLMs. The code is available at
https://github.com/LilyNLP/ADFAR
Related papers
- CR-UTP: Certified Robustness against Universal Text Perturbations on Large Language Models [12.386141652094999]
Existing certified robustness based on random smoothing has shown considerable promise in certifying the input-specific text perturbations.
A naive method is to simply increase the masking ratio and the likelihood of masking attack tokens, but it leads to a significant reduction in both certified accuracy and the certified radius.
We introduce a novel approach, designed to identify a superior prompt that maintains higher certified accuracy under extensive masking.
arXiv Detail & Related papers (2024-06-04T01:02:22Z) - Token-Level Adversarial Prompt Detection Based on Perplexity Measures
and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks.
This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs.
We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z) - Alleviating Over-smoothing for Unsupervised Sentence Representation [96.19497378628594]
We present a Simple method named Self-Contrastive Learning (SSCL) to alleviate this issue.
Our proposed method is quite simple and can be easily extended to various state-of-the-art models for performance boosting.
arXiv Detail & Related papers (2023-05-09T11:00:02Z) - ADEPT: A DEbiasing PrompT Framework [49.582497203415855]
Finetuning is an applicable approach for debiasing contextualized word embeddings.
discrete prompts with semantic meanings have shown to be effective in debiasing tasks.
We propose ADEPT, a method to debias PLMs using prompt tuning while maintaining the delicate balance between removing biases and ensuring representation ability.
arXiv Detail & Related papers (2022-11-10T08:41:40Z) - Bridge the Gap Between CV and NLP! A Gradient-based Textual Adversarial
Attack Framework [17.17479625646699]
We propose a unified framework to craft textual adversarial samples.
In this paper, we instantiate our framework with an attack algorithm named Textual Projected Gradient Descent (T-PGD)
arXiv Detail & Related papers (2021-10-28T17:31:51Z) - Learning to Ask Conversational Questions by Optimizing Levenshtein
Distance [83.53855889592734]
We introduce a Reinforcement Iterative Sequence Editing (RISE) framework that optimize the minimum Levenshtein distance (MLD) through explicit editing actions.
RISE is able to pay attention to tokens that are related to conversational characteristics.
Experimental results on two benchmark datasets show that RISE significantly outperforms state-of-the-art methods.
arXiv Detail & Related papers (2021-06-30T08:44:19Z) - Disentangled Contrastive Learning for Learning Robust Textual
Representations [13.880693856907037]
We introduce the concept of momentum representation consistency to align features and leverage power normalization while conforming the uniformity.
Our experimental results for the NLP benchmarks demonstrate that our approach can obtain better results compared with the baselines.
arXiv Detail & Related papers (2021-04-11T03:32:49Z) - Defense against Adversarial Attacks in NLP via Dirichlet Neighborhood
Ensemble [163.3333439344695]
Dirichlet Neighborhood Ensemble (DNE) is a randomized smoothing method for training a robust model to defense substitution-based attacks.
DNE forms virtual sentences by sampling embedding vectors for each word in an input sentence from a convex hull spanned by the word and its synonyms, and it augments them with the training data.
We demonstrate through extensive experimentation that our method consistently outperforms recently proposed defense methods by a significant margin across different network architectures and multiple data sets.
arXiv Detail & Related papers (2020-06-20T18:01:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.