MSDT: Masked Language Model Scoring Defense in Text Domain
- URL: http://arxiv.org/abs/2211.05371v1
- Date: Thu, 10 Nov 2022 06:46:47 GMT
- Title: MSDT: Masked Language Model Scoring Defense in Text Domain
- Authors: Jaechul Roh, Minhao Cheng, Yajun Fang
- Abstract summary: We will introduce a novel improved textual backdoor defense method, named MSDT, that outperforms the current existing defensive algorithms in specific datasets.
experimental results illustrate that our method can be effective and constructive in terms of defending against backdoor attack in text domain.
- Score: 16.182765935007254
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained language models allowed us to process downstream tasks with the
help of fine-tuning, which aids the model to achieve fairly high accuracy in
various Natural Language Processing (NLP) tasks. Such easily-downloaded
language models from various websites empowered the public users as well as
some major institutions to give a momentum to their real-life application.
However, it was recently proven that models become extremely vulnerable when
they are backdoor attacked with trigger-inserted poisoned datasets by malicious
users. The attackers then redistribute the victim models to the public to
attract other users to use them, where the models tend to misclassify when
certain triggers are detected within the training sample. In this paper, we
will introduce a novel improved textual backdoor defense method, named MSDT,
that outperforms the current existing defensive algorithms in specific
datasets. The experimental results illustrate that our method can be effective
and constructive in terms of defending against backdoor attack in text domain.
Code is available at https://github.com/jcroh0508/MSDT.
Related papers
- Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning [14.011140902511135]
In-context learning, a paradigm bridging the gap between pre-training and fine-tuning, has demonstrated high efficacy in several NLP tasks.
Despite being widely applied, in-context learning is vulnerable to malicious attacks.
We design a new backdoor attack method, named ICLAttack, to target large language models based on in-context learning.
arXiv Detail & Related papers (2024-01-11T14:38:19Z) - Setting the Trap: Capturing and Defeating Backdoors in Pretrained
Language Models through Honeypots [68.84056762301329]
Recent research has exposed the susceptibility of pretrained language models (PLMs) to backdoor attacks.
We propose and integrate a honeypot module into the original PLM to absorb backdoor information exclusively.
Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features.
arXiv Detail & Related papers (2023-10-28T08:21:16Z) - Large Language Models Are Better Adversaries: Exploring Generative
Clean-Label Backdoor Attacks Against Text Classifiers [25.94356063000699]
Backdoor attacks manipulate model predictions by inserting innocuous triggers into training and test data.
We focus on more realistic and more challenging clean-label attacks where the adversarial training examples are correctly labeled.
Our attack, LLMBkd, leverages language models to automatically insert diverse style-based triggers into texts.
arXiv Detail & Related papers (2023-10-28T06:11:07Z) - Training-free Lexical Backdoor Attacks on Language Models [30.91728116238065]
We propose Training-Free Lexical Backdoor Attack (TFLexAttack) as the first training-free backdoor attack on language models.
Our attack is achieved by injecting lexical triggers into the tokenizer of a language model via manipulating its embedding dictionary.
We conduct extensive experiments on three dominant NLP tasks based on nine language models to demonstrate the effectiveness and universality of our attack.
arXiv Detail & Related papers (2023-02-08T15:18:51Z) - MOVE: Effective and Harmless Ownership Verification via Embedded
External Features [109.19238806106426]
We propose an effective and harmless model ownership verification (MOVE) to defend against different types of model stealing simultaneously.
We conduct the ownership verification by verifying whether a suspicious model contains the knowledge of defender-specified external features.
In particular, we develop our MOVE method under both white-box and black-box settings to provide comprehensive model protection.
arXiv Detail & Related papers (2022-08-04T02:22:29Z) - BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation
Models [25.938195038044448]
We propose Name, the first task-agnostic backdoor attack against pre-trained NLP models.
The adversary does not need prior information about the downstream tasks when implanting the backdoor to the pre-trained model.
Experimental results indicate that our approach can compromise a wide range of downstream NLP tasks in an effective and stealthy way.
arXiv Detail & Related papers (2021-10-06T02:48:58Z) - Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability
of the Embedding Layers in NLP Models [27.100909068228813]
Recent studies have revealed a security threat to natural language processing (NLP) models, called the Backdoor Attack.
In this paper, we find that it is possible to hack the model in a data-free way by modifying one single word embedding vector.
Experimental results on sentiment analysis and sentence-pair classification tasks show that our method is more efficient and stealthier.
arXiv Detail & Related papers (2021-03-29T12:19:45Z) - Black-box Detection of Backdoor Attacks with Limited Information and
Data [56.0735480850555]
We propose a black-box backdoor detection (B3D) method to identify backdoor attacks with only query access to the model.
In addition to backdoor detection, we also propose a simple strategy for reliable predictions using the identified backdoored models.
arXiv Detail & Related papers (2021-03-24T12:06:40Z) - Hidden Backdoor Attack against Semantic Segmentation Models [60.0327238844584]
The emphbackdoor attack intends to embed hidden backdoors in deep neural networks (DNNs) by poisoning training data.
We propose a novel attack paradigm, the emphfine-grained attack, where we treat the target label from the object-level instead of the image-level.
Experiments show that the proposed methods can successfully attack semantic segmentation models by poisoning only a small proportion of training data.
arXiv Detail & Related papers (2021-03-06T05:50:29Z) - Learning to Attack: Towards Textual Adversarial Attacking in Real-world
Situations [81.82518920087175]
Adversarial attacking aims to fool deep neural networks with adversarial examples.
We propose a reinforcement learning based attack model, which can learn from attack history and launch attacks more efficiently.
arXiv Detail & Related papers (2020-09-19T09:12:24Z) - Defense against Adversarial Attacks in NLP via Dirichlet Neighborhood
Ensemble [163.3333439344695]
Dirichlet Neighborhood Ensemble (DNE) is a randomized smoothing method for training a robust model to defense substitution-based attacks.
DNE forms virtual sentences by sampling embedding vectors for each word in an input sentence from a convex hull spanned by the word and its synonyms, and it augments them with the training data.
We demonstrate through extensive experimentation that our method consistently outperforms recently proposed defense methods by a significant margin across different network architectures and multiple data sets.
arXiv Detail & Related papers (2020-06-20T18:01:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.