Related papers: Defending Pre-trained Language Models from Adversarial Word Substitutions Without Performance Sacrifice

Defending Pre-trained Language Models from Adversarial Word Substitutions Without Performance Sacrifice

URL: http://arxiv.org/abs/2105.14553v1
Date: Sun, 30 May 2021 14:24:53 GMT
Title: Defending Pre-trained Language Models from Adversarial Word Substitutions Without Performance Sacrifice
Authors: Rongzhou Bao, Jiayi Wang, Hai Zhao
Abstract summary: adversarial word substitution is one of the most challenging textual adversarial attack methods. This paper presents a compact and performance-preserved framework, Anomaly Detection with Frequency-Aware Randomization (ADFAR) We show that ADFAR significantly outperforms those newly proposed defense methods over various tasks with much higher inference speed.
Score: 42.490810188180546
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pre-trained contextualized language models (PrLMs) have led to strong performance gains in downstream natural language understanding tasks. However, PrLMs can still be easily fooled by adversarial word substitution, which is one of the most challenging textual adversarial attack methods. Existing defence approaches suffer from notable performance loss and complexities. Thus, this paper presents a compact and performance-preserved framework, Anomaly Detection with Frequency-Aware Randomization (ADFAR). In detail, we design an auxiliary anomaly detection classifier and adopt a multi-task learning procedure, by which PrLMs are able to distinguish adversarial input samples. Then, in order to defend adversarial word substitution, a frequency-aware randomization process is applied to those recognized adversarial input samples. Empirical results show that ADFAR significantly outperforms those newly proposed defense methods over various tasks with much higher inference speed. Remarkably, ADFAR does not impair the overall performance of PrLMs. The code is available at https://github.com/LilyNLP/ADFAR

Related papers

CR-UTP: Certified Robustness against Universal Text Perturbations on Large Language Models [12.386141652094999]
Existing certified robustness based on random smoothing has shown considerable promise in certifying the input-specific text perturbations. A naive method is to simply increase the masking ratio and the likelihood of masking attack tokens, but it leads to a significant reduction in both certified accuracy and the certified radius. We introduce a novel approach, designed to identify a superior prompt that maintains higher certified accuracy under extensive masking.
arXiv Detail & Related papers (2024-06-04T01:02:22Z)
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks. This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs. We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z)
Alleviating Over-smoothing for Unsupervised Sentence Representation [96.19497378628594]
We present a Simple method named Self-Contrastive Learning (SSCL) to alleviate this issue. Our proposed method is quite simple and can be easily extended to various state-of-the-art models for performance boosting.
arXiv Detail & Related papers (2023-05-09T11:00:02Z)
ADEPT: A DEbiasing PrompT Framework [49.582497203415855]
Finetuning is an applicable approach for debiasing contextualized word embeddings. discrete prompts with semantic meanings have shown to be effective in debiasing tasks. We propose ADEPT, a method to debias PLMs using prompt tuning while maintaining the delicate balance between removing biases and ensuring representation ability.
arXiv Detail & Related papers (2022-11-10T08:41:40Z)
Bridge the Gap Between CV and NLP! A Gradient-based Textual Adversarial Attack Framework [17.17479625646699]
We propose a unified framework to craft textual adversarial samples. In this paper, we instantiate our framework with an attack algorithm named Textual Projected Gradient Descent (T-PGD)
arXiv Detail & Related papers (2021-10-28T17:31:51Z)
Learning to Ask Conversational Questions by Optimizing Levenshtein Distance [83.53855889592734]
We introduce a Reinforcement Iterative Sequence Editing (RISE) framework that optimize the minimum Levenshtein distance (MLD) through explicit editing actions. RISE is able to pay attention to tokens that are related to conversational characteristics. Experimental results on two benchmark datasets show that RISE significantly outperforms state-of-the-art methods.
arXiv Detail & Related papers (2021-06-30T08:44:19Z)
Disentangled Contrastive Learning for Learning Robust Textual Representations [13.880693856907037]
We introduce the concept of momentum representation consistency to align features and leverage power normalization while conforming the uniformity. Our experimental results for the NLP benchmarks demonstrate that our approach can obtain better results compared with the baselines.
arXiv Detail & Related papers (2021-04-11T03:32:49Z)
Defense against Adversarial Attacks in NLP via Dirichlet Neighborhood Ensemble [163.3333439344695]
Dirichlet Neighborhood Ensemble (DNE) is a randomized smoothing method for training a robust model to defense substitution-based attacks. DNE forms virtual sentences by sampling embedding vectors for each word in an input sentence from a convex hull spanned by the word and its synonyms, and it augments them with the training data. We demonstrate through extensive experimentation that our method consistently outperforms recently proposed defense methods by a significant margin across different network architectures and multiple data sets.
arXiv Detail & Related papers (2020-06-20T18:01:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.