Combing for Credentials: Active Pattern Extraction from Smart Reply
- URL: http://arxiv.org/abs/2207.10802v3
- Date: Sat, 2 Sep 2023 22:33:09 GMT
- Title: Combing for Credentials: Active Pattern Extraction from Smart Reply
- Authors: Bargav Jayaraman, Esha Ghosh, Melissa Chase, Sambuddha Roy, Wei Dai,
David Evans
- Abstract summary: We investigate potential information leakage vulnerabilities in a typical Smart Reply pipeline.
We introduce a new type of active extraction attack that exploits canonical patterns in text containing sensitive data.
We show experimentally that it is possible for an adversary to extract sensitive user information present in the training data, even in realistic settings.
- Score: 15.097010165958027
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-trained large language models, such as GPT\nobreakdash-2 and BERT, are
often fine-tuned to achieve state-of-the-art performance on a downstream task.
One natural example is the ``Smart Reply'' application where a pre-trained
model is tuned to provide suggested responses for a given query message. Since
the tuning data is often sensitive data such as emails or chat transcripts, it
is important to understand and mitigate the risk that the model leaks its
tuning data. We investigate potential information leakage vulnerabilities in a
typical Smart Reply pipeline. We consider a realistic setting where the
adversary can only interact with the underlying model through a front-end
interface that constrains what types of queries can be sent to the model.
Previous attacks do not work in these settings, but require the ability to send
unconstrained queries directly to the model. Even when there are no constraints
on the queries, previous attacks typically require thousands, or even millions,
of queries to extract useful information, while our attacks can extract
sensitive data in just a handful of queries. We introduce a new type of active
extraction attack that exploits canonical patterns in text containing sensitive
data. We show experimentally that it is possible for an adversary to extract
sensitive user information present in the training data, even in realistic
settings where all interactions with the model must go through a front-end that
limits the types of queries. We explore potential mitigation strategies and
demonstrate empirically how differential privacy appears to be a reasonably
effective defense mechanism to such pattern extraction attacks.
Related papers
- MisGUIDE : Defense Against Data-Free Deep Learning Model Extraction [0.8437187555622164]
"MisGUIDE" is a two-step defense framework for Deep Learning models that disrupts the adversarial sample generation process.
The aim of the proposed defense method is to reduce the accuracy of the cloned model while maintaining accuracy on authentic queries.
arXiv Detail & Related papers (2024-03-27T13:59:21Z) - DTA: Distribution Transform-based Attack for Query-Limited Scenario [11.874670564015789]
In generating adversarial examples, the conventional black-box attack methods rely on sufficient feedback from the to-be-attacked models.
This paper proposes a hard-label attack that simulates an attacked action being permitted to conduct a limited number of queries.
Experiments validate the effectiveness of the proposed idea and the superiority of DTA over the state-of-the-art.
arXiv Detail & Related papers (2023-12-12T13:21:03Z) - Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines [83.65380507372483]
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box.
This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval.
arXiv Detail & Related papers (2023-11-29T05:33:28Z) - Scalable Extraction of Training Data from (Production) Language Models [93.7746567808049]
This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset.
We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT.
arXiv Detail & Related papers (2023-11-28T18:47:03Z) - Detecting Pretraining Data from Large Language Models [90.12037980837738]
We study the pretraining data detection problem.
Given a piece of text and black-box access to an LLM without knowing the pretraining data, can we determine if the model was trained on the provided text?
We introduce a new detection method Min-K% Prob based on a simple hypothesis.
arXiv Detail & Related papers (2023-10-25T17:21:23Z) - MeaeQ: Mount Model Extraction Attacks with Efficient Queries [6.1106195466129485]
We study model extraction attacks in natural language processing (NLP)
We propose MeaeQ, a straightforward yet effective method to address these issues.
MeaeQ achieves higher functional similarity to the victim model than baselines while requiring fewer queries.
arXiv Detail & Related papers (2023-10-21T16:07:16Z) - Learning to Unlearn: Instance-wise Unlearning for Pre-trained
Classifiers [71.70205894168039]
We consider instance-wise unlearning, of which the goal is to delete information on a set of instances from a pre-trained model.
We propose two methods that reduce forgetting on the remaining data: 1) utilizing adversarial examples to overcome forgetting at the representation-level and 2) leveraging weight importance metrics to pinpoint network parameters guilty of propagating unwanted information.
arXiv Detail & Related papers (2023-01-27T07:53:50Z) - Generalizable Black-Box Adversarial Attack with Meta Learning [54.196613395045595]
In black-box adversarial attack, the target model's parameters are unknown, and the attacker aims to find a successful perturbation based on query feedback under a query budget.
We propose to utilize the feedback information across historical attacks, dubbed example-level adversarial transferability.
The proposed framework with the two types of adversarial transferability can be naturally combined with any off-the-shelf query-based attack methods to boost their performance.
arXiv Detail & Related papers (2023-01-01T07:24:12Z) - A Unified Evaluation of Textual Backdoor Learning: Frameworks and
Benchmarks [72.7373468905418]
We develop an open-source toolkit OpenBackdoor to foster the implementations and evaluations of textual backdoor learning.
We also propose CUBE, a simple yet strong clustering-based defense baseline.
arXiv Detail & Related papers (2022-06-17T02:29:23Z) - Exploring the Universal Vulnerability of Prompt-based Learning Paradigm [21.113683206722207]
We find that prompt-based learning bridges the gap between pre-training and fine-tuning, and works effectively under the few-shot setting.
However, we find that this learning paradigm inherits the vulnerability from the pre-training stage, where model predictions can be misled by inserting certain triggers into the text.
We explore this universal vulnerability by either injecting backdoor triggers or searching for adversarial triggers on pre-trained language models using only plain text.
arXiv Detail & Related papers (2022-04-11T16:34:10Z) - Explain2Attack: Text Adversarial Attacks via Cross-Domain
Interpretability [18.92690624514601]
Research has shown that down-stream models can be easily fooled with adversarial inputs that look like the training data, but slightly perturbed, in a way imperceptible to humans.
In this paper, we propose Explain2Attack, a black-box adversarial attack on text classification task.
We show that our framework either achieves or out-performs attack rates of the state-of-the-art models, yet with lower queries cost and higher efficiency.
arXiv Detail & Related papers (2020-10-14T04:56:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.