MoPe: Model Perturbation-based Privacy Attacks on Language Models
- URL: http://arxiv.org/abs/2310.14369v1
- Date: Sun, 22 Oct 2023 17:33:19 GMT
- Title: MoPe: Model Perturbation-based Privacy Attacks on Language Models
- Authors: Marvin Li, Jason Wang, Jeffrey Wang, Seth Neel
- Abstract summary: Large Language Models (LLMs) can unintentionally leak sensitive information present in their training data.
We present Model Perturbations (MoPe), a new method to identify with high confidence if a given text is in the training data of a pre-trained language model.
- Score: 4.4746931463927835
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work has shown that Large Language Models (LLMs) can unintentionally
leak sensitive information present in their training data. In this paper, we
present Model Perturbations (MoPe), a new method to identify with high
confidence if a given text is in the training data of a pre-trained language
model, given white-box access to the models parameters. MoPe adds noise to the
model in parameter space and measures the drop in log-likelihood at a given
point $x$, a statistic we show approximates the trace of the Hessian matrix
with respect to model parameters. Across language models ranging from $70$M to
$12$B parameters, we show that MoPe is more effective than existing loss-based
attacks and recently proposed perturbation-based methods. We also examine the
role of training point order and model size in attack success, and empirically
demonstrate that MoPe accurately approximate the trace of the Hessian in
practice. Our results show that the loss of a point alone is insufficient to
determine extractability -- there are training points we can recover using our
method that have average loss. This casts some doubt on prior works that use
the loss of a point as evidence of memorization or unlearning.
Related papers
- Causal Estimation of Memorisation Profiles [58.20086589761273]
Understanding memorisation in language models has practical and societal implications.
Memorisation is the causal effect of training with an instance on the model's ability to predict that instance.
This paper proposes a new, principled, and efficient method to estimate memorisation based on the difference-in-differences design from econometrics.
arXiv Detail & Related papers (2024-06-06T17:59:09Z) - Pandora's White-Box: Precise Training Data Detection and Extraction in Large Language Models [4.081098869497239]
We develop state-of-the-art privacy attacks against Large Language Models (LLMs)
New membership inference attacks (MIAs) against pretrained LLMs perform hundreds of times better than baseline attacks.
In fine-tuning, we find that a simple attack based on the ratio of the loss between the base and fine-tuned models is able to achieve near-perfect MIA performance.
arXiv Detail & Related papers (2024-02-26T20:41:50Z) - Scalable Extraction of Training Data from (Production) Language Models [93.7746567808049]
This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset.
We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT.
arXiv Detail & Related papers (2023-11-28T18:47:03Z) - Detecting Pretraining Data from Large Language Models [90.12037980837738]
We study the pretraining data detection problem.
Given a piece of text and black-box access to an LLM without knowing the pretraining data, can we determine if the model was trained on the provided text?
We introduce a new detection method Min-K% Prob based on a simple hypothesis.
arXiv Detail & Related papers (2023-10-25T17:21:23Z) - In-Context Unlearning: Language Models as Few Shot Unlearners [27.962361828354716]
We propose a new class of unlearning methods for Large Language Models (LLMs)
This method unlearns instances from the model by simply providing specific kinds of inputs in context, without the need to update model parameters.
Our experimental results demonstrate that in-context unlearning performs on par with, or in some cases outperforms other state-of-the-art methods that require access to model parameters.
arXiv Detail & Related papers (2023-10-11T15:19:31Z) - Beyond Labeling Oracles: What does it mean to steal ML models? [52.63413852460003]
Model extraction attacks are designed to steal trained models with only query access.
We investigate factors influencing the success of model extraction attacks.
Our findings urge the community to redefine the adversarial goals of ME attacks.
arXiv Detail & Related papers (2023-10-03T11:10:21Z) - Defense-Prefix for Preventing Typographic Attacks on CLIP [14.832208701208414]
Some adversarial attacks fool a model into false or absurd classifications.
We introduce our simple yet effective method: Defense-Prefix (DP), which inserts the DP token before a class name to make words "robust" against typographic attacks.
Our method significantly improves the accuracy of classification tasks for typographic attack datasets, while maintaining the zero-shot capabilities of the model.
arXiv Detail & Related papers (2023-04-10T11:05:20Z) - Predictable MDP Abstraction for Unsupervised Model-Based RL [93.91375268580806]
We propose predictable MDP abstraction (PMA)
Instead of training a predictive model on the original MDP, we train a model on a transformed MDP with a learned action space.
We theoretically analyze PMA and empirically demonstrate that PMA leads to significant improvements over prior unsupervised model-based RL approaches.
arXiv Detail & Related papers (2023-02-08T07:37:51Z) - Model Extraction Attack against Self-supervised Speech Models [52.81330435990717]
Self-supervised learning (SSL) speech models generate meaningful representations of given clips.
Model extraction attack (MEA) often refers to an adversary stealing the functionality of the victim model with only query access.
We study the MEA problem against SSL speech model with a small number of queries.
arXiv Detail & Related papers (2022-11-29T09:28:05Z) - Training Data Leakage Analysis in Language Models [6.843491191969066]
We introduce a methodology that investigates identifying the user content in the training data that could be leaked under a strong and realistic threat model.
We propose two metrics to quantify user-level data leakage by measuring a model's ability to produce unique sentence fragments within training data.
arXiv Detail & Related papers (2021-01-14T00:57:32Z) - Cold-start Active Learning through Self-supervised Language Modeling [15.551710499866239]
Active learning aims to reduce annotation costs by choosing the most critical examples to label.
With BERT, we develop a simple strategy based on the masked language modeling loss.
Compared to other baselines, our approach reaches higher accuracy within less sampling iterations and time.
arXiv Detail & Related papers (2020-10-19T14:09:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.