Model Extraction and Adversarial Transferability, Your BERT is
Vulnerable!
- URL: http://arxiv.org/abs/2103.10013v1
- Date: Thu, 18 Mar 2021 04:23:21 GMT
- Title: Model Extraction and Adversarial Transferability, Your BERT is
Vulnerable!
- Authors: Xuanli He and Lingjuan Lyu and Qiongkai Xu and Lichao Sun
- Abstract summary: We show how an adversary can steal a BERT-based API service on multiple benchmark datasets with limited prior knowledge and queries.
We also show that the extracted model can lead to highly transferable adversarial attacks against the victim model.
Our studies indicate that the potential vulnerabilities of BERT-based API services still hold, even when there is an architectural mismatch between the victim model and the attack model.
- Score: 11.425692676973332
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural language processing (NLP) tasks, ranging from text classification to
text generation, have been revolutionised by the pre-trained language models,
such as BERT. This allows corporations to easily build powerful APIs by
encapsulating fine-tuned BERT models for downstream tasks. However, when a
fine-tuned BERT model is deployed as a service, it may suffer from different
attacks launched by malicious users. In this work, we first present how an
adversary can steal a BERT-based API service (the victim/target model) on
multiple benchmark datasets with limited prior knowledge and queries. We
further show that the extracted model can lead to highly transferable
adversarial attacks against the victim model. Our studies indicate that the
potential vulnerabilities of BERT-based API services still hold, even when
there is an architectural mismatch between the victim model and the attack
model. Finally, we investigate two defence strategies to protect the victim
model and find that unless the performance of the victim model is sacrificed,
both model ex-traction and adversarial transferability can effectively
compromise the target models
Related papers
- MisGUIDE : Defense Against Data-Free Deep Learning Model Extraction [0.8437187555622164]
"MisGUIDE" is a two-step defense framework for Deep Learning models that disrupts the adversarial sample generation process.
The aim of the proposed defense method is to reduce the accuracy of the cloned model while maintaining accuracy on authentic queries.
arXiv Detail & Related papers (2024-03-27T13:59:21Z) - Query-Based Adversarial Prompt Generation [67.238873588125]
We build adversarial examples that cause an aligned language model to emit harmful strings.
We validate our attack on GPT-3.5 and OpenAI's safety classifier.
arXiv Detail & Related papers (2024-02-19T18:01:36Z) - Arabic Synonym BERT-based Adversarial Examples for Text Classification [0.0]
This paper introduces the first word-level study of adversarial attacks in Arabic.
We assess the robustness of the state-of-the-art text classification models to adversarial attacks in Arabic.
We study the transferability of these newly produced Arabic adversarial examples to various models and investigate the effectiveness of defense mechanisms.
arXiv Detail & Related papers (2024-02-05T19:39:07Z) - MSDT: Masked Language Model Scoring Defense in Text Domain [16.182765935007254]
We will introduce a novel improved textual backdoor defense method, named MSDT, that outperforms the current existing defensive algorithms in specific datasets.
experimental results illustrate that our method can be effective and constructive in terms of defending against backdoor attack in text domain.
arXiv Detail & Related papers (2022-11-10T06:46:47Z) - MOVE: Effective and Harmless Ownership Verification via Embedded
External Features [109.19238806106426]
We propose an effective and harmless model ownership verification (MOVE) to defend against different types of model stealing simultaneously.
We conduct the ownership verification by verifying whether a suspicious model contains the knowledge of defender-specified external features.
In particular, we develop our MOVE method under both white-box and black-box settings to provide comprehensive model protection.
arXiv Detail & Related papers (2022-08-04T02:22:29Z) - Backdoor Pre-trained Models Can Transfer to All [33.720258110911274]
We propose a new approach to map the inputs containing triggers directly to a predefined output representation of pre-trained NLP models.
In light of the unique properties of triggers in NLP, we propose two new metrics to measure the performance of backdoor attacks.
arXiv Detail & Related papers (2021-10-30T07:11:24Z) - Killing Two Birds with One Stone: Stealing Model and Inferring Attribute
from BERT-based APIs [26.38350928431939]
We present an effective model extraction attack, where the adversary can practically steal a BERT-based API.
We develop an effective inference attack to expose the sensitive attribute of the training data used by the BERT-based APIs.
arXiv Detail & Related papers (2021-05-23T10:38:23Z) - Towards Variable-Length Textual Adversarial Attacks [68.27995111870712]
It is non-trivial to conduct textual adversarial attacks on natural language processing tasks due to the discreteness of data.
In this paper, we propose variable-length textual adversarial attacks(VL-Attack)
Our method can achieve $33.18$ BLEU score on IWSLT14 German-English translation, achieving an improvement of $1.47$ over the baseline model.
arXiv Detail & Related papers (2021-04-16T14:37:27Z) - Learning to Attack: Towards Textual Adversarial Attacking in Real-world
Situations [81.82518920087175]
Adversarial attacking aims to fool deep neural networks with adversarial examples.
We propose a reinforcement learning based attack model, which can learn from attack history and launch attacks more efficiently.
arXiv Detail & Related papers (2020-09-19T09:12:24Z) - BERT-ATTACK: Adversarial Attack Against BERT Using BERT [77.82947768158132]
Adrial attacks for discrete data (such as texts) are more challenging than continuous data (such as images)
We propose textbfBERT-Attack, a high-quality and effective method to generate adversarial samples.
Our method outperforms state-of-the-art attack strategies in both success rate and perturb percentage.
arXiv Detail & Related papers (2020-04-21T13:30:02Z) - DaST: Data-free Substitute Training for Adversarial Attacks [55.76371274622313]
We propose a data-free substitute training method (DaST) to obtain substitute models for adversarial black-box attacks.
To achieve this, DaST utilizes specially designed generative adversarial networks (GANs) to train the substitute models.
Experiments demonstrate the substitute models can achieve competitive performance compared with the baseline models.
arXiv Detail & Related papers (2020-03-28T04:28:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.